Best Practices to Estimate web archive storage cost

To estimate web archive storage cost, run a small pilot crawl to measure average WARC bytes per page, multiply by your projected page count, then multiply again by 2–3× to account for indexes, derivatives and preservation copies — and add 20–30% headroom. Sizing only the raw WARCs is the classic mistake that leaves projects out of space halfway through. This guide gives a repeatable estimation method and a checklist so every project's figure is documented the same way.

Why can't I just guess from page count?

Because page weight varies enormously. A text-heavy news article might be 200 KB captured; a media-rich page with video, fonts and trackers can exceed 10 MB. Guessing without measurement produces estimates that are wrong by an order of magnitude. The reliable method is measure, then multiply.

Step 1 — Run a pilot and measure bytes per page

Crawl a representative sample (a few hundred pages across the site types you will capture), then divide WARC bytes by pages captured:

bash

# After a pilot crawl, total WARC bytes
du -bc crawls/pilot/*.warc.gz | tail -1
# Count captured response records (pages + resources)
zcat crawls/pilot/*.warc.gz | grep -c "WARC-Type: response"

If 300 seed pages produced 1.8 GB of WARC, that is ~6 MB/page including all sub-resources — use that figure, not the HTML size alone.

Step 2 — Project the raw WARC volume

Multiply measured bytes/page by your estimated total pages:

raw_warc = avg_bytes_per_page × total_pages
e.g.      = 6 MB × 50,000 pages = ~300 GB raw WARC

Be honest about the page count; crawl scope and depth drive it more than seed count does.

Step 3 — Add the rest of the footprint

Raw WARCs are only part of the bill. Size the whole footprint:

Component	Typical share of raw WARC	Notes
Raw WARCs	1.0× (baseline)	The captures themselves
CDX index	~0.02–0.05×	Small, always needed
Full-text index	1–3× (if used)	Large; only if content search needed
Derivatives	varies	Thumbnails, extracted text
Preservation copies	+1× to +2×	3-2-1 means 2+ extra copies

A safe rule for a CDX-only collection with proper backups: plan for ~2.5× the raw WARC volume. Add full-text and it climbs toward 4–5×.

How does deduplication change the numbers?

WARCs are already gzip-compressed per record, so you will not squeeze much more from text. The real lever is revisit records: when a recrawl finds an unchanged resource, the crawler stores a tiny pointer instead of the full payload. On collections that recrawl the same sites frequently, this can cut storage dramatically because cost then tracks how much the sites change, not how often you crawl. Enable deduplication in Browsertrix/Heritrix and your high-frequency collections become affordable.

What does a recrawl schedule do to cost?

Without dedup, weekly crawls of the same sites multiply storage roughly linearly — 52 nearly identical copies a year. With revisit records, you store the changed bytes plus pointers. Model it as:

yearly_storage ≈ first_full_crawl + (crawls_per_year − 1) × change_rate × first_full_crawl

A site that changes ~5% per week costs far less to recrawl weekly than naive multiplication suggests.

A working estimation checklist

[ ] Pilot crawl of a representative sample completed.
[ ] Average WARC bytes per page recorded (with sub-resources).
[ ] Total page count projected from real scope/depth, not seed count.
[ ] Raw WARC volume calculated.
[ ] Multiplied by 2–3× for indexes, derivatives, preservation copies.
[ ] Full-text indexing decision made (adds 1–3×).
[ ] Deduplication enabled for recrawled collections.
[ ] 20–30% headroom added.
[ ] Storage tier (hot/cold) and egress costs noted.
[ ] Figure, assumptions and pilot data documented together.

What's the costliest mistake?

Sizing for raw WARCs alone and forgetting that a defensible archive needs indexes and multiple preservation copies. The first copy is cheap; the discipline of 3-2-1 backups and a full-text index is what fills the disk. Estimate the full footprint up front and the project never stalls on storage.

Key Takeaways

Measure bytes/page from a pilot, then multiply — never guess from page count.
Page weight varies 50×+, so a representative sample is essential.
Size the whole footprint: WARCs + indexes + derivatives + backups.
Plan for ~2.5× raw WARC for a CDX-only collection with 3-2-1 backups.
Deduplication (revisit records) makes frequent recrawls affordable.
Recrawl cost tracks how much sites change, not how often you crawl.
Always add 20–30% headroom and document every assumption.

Frequently Asked Questions

How do I estimate the storage size of a web crawl before running it?

Run a small pilot crawl of a representative sample, measure the average WARC bytes per page, then multiply by your estimated total page count. Add overhead for indexes, derivatives and preservation copies — typically you size for roughly two to three times the raw WARC volume.

How much does it cost to store a terabyte of web archive per year?

It varies widely by tier and provider, from a few dollars per terabyte-month on cold object storage to ten times that on hot, replicated storage. The dominant cost over time is usually not the first copy but the multiple preservation copies, egress and management you layer on top.

Should I count just the WARCs or the whole footprint?

Count the whole footprint. Beyond raw WARCs you need CDX indexes, optional full-text indexes, derivatives, and at least two backup copies under a 3-2-1 strategy, which together commonly multiply the raw size by two to three.

Does compression reduce web archive storage cost much?

WARCs are usually already gzip-compressed per record, so most text savings are baked in. Additional savings come mostly from deduplicating repeated resources (revisit records), which can cut storage substantially on collections that recrawl the same sites.

How does recrawl frequency affect cost?

Higher frequency multiplies storage roughly linearly unless you deduplicate. With revisit records, unchanged resources are stored once and later crawls only reference them, so cost grows with how much the sites actually change rather than how often you crawl.

What is the most common mistake in storage estimates?

Sizing for the raw WARCs only and forgetting indexes, derivatives and multiple preservation copies, which leads to running out of space mid-project. Always estimate the full footprint and add headroom of at least 20 to 30 percent.

Why can't I just guess from page count? ​

Step 1 — Run a pilot and measure bytes per page ​

Step 2 — Project the raw WARC volume ​

Step 3 — Add the rest of the footprint ​

How does deduplication change the numbers? ​

What does a recrawl schedule do to cost? ​

A working estimation checklist ​

What's the costliest mistake? ​

Key Takeaways ​

Frequently Asked Questions ​

How do I estimate the storage size of a web crawl before running it? ​

How much does it cost to store a terabyte of web archive per year? ​

Should I count just the WARCs or the whole footprint? ​

Does compression reduce web archive storage cost much? ​

How does recrawl frequency affect cost? ​

What is the most common mistake in storage estimates? ​

Related reading ​