Best Practices to Understand the WARC format

The WARC (Web ARChive) format, standardised as ISO 28500, is a container that stores the raw HTTP requests and responses captured during a crawl, one after another, each wrapped in a record with its own headers and timestamp. Understanding it means understanding three things: record types, deduplication via revisit records, and per-record gzip compression. Get those right and your collection stays consistent, indexable and defensible.

What is inside a WARC record?

Each record has a block of WARC headers (its own metadata) followed by the captured content. A response record, for example, carries WARC-Type: response, a WARC-Record-ID (a URN), a WARC-Date, the WARC-Target-URI, a Content-Type of application/http; msgtype=response, and a WARC-Payload-Digest. Then comes the literal HTTP — the status line, the response headers, and the body bytes.

text

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:5f7e...>
WARC-Date: 2025-01-22T09:14:55Z
WARC-Target-URI: https://example.org/index.html
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
Content-Type: application/http; msgtype=response
Content-Length: 5821

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
...the page body...

What are the record types you actually meet?

warcinfo — one per file, describing the crawler, operator and capture parameters.
request / response — the HTTP pair for each fetched URL.
resource — content captured without an HTTP exchange (e.g. a local file).
revisit — a pointer to an identical earlier payload (deduplication).
metadata — annotations such as detected fields, attached to another record.
conversion — a migrated/normalised version of an earlier record.
continuation — a record split across files.

In a real crawl, response and revisit dominate the counts.

Why do revisit records matter so much?

A revisit record stores no payload; it references a prior response by digest and says "this URL returned the same bytes at this later time." This deduplication can shrink a recurring crawl dramatically. The trap: a revisit is only resolvable if the record it points to still exists in your collection. If you split, prune or migrate WARCs carelessly, you can orphan revisits and break replay. Best practice: never delete the WARC that holds the original responses a later crawl deduplicates against.

How is a WARC compressed and indexed?

The convention is per-record gzip, producing .warc.gz. Because each record is independently compressed, an indexer can record a byte offset to any record and seek straight to it. That index is the CDX (or compressed CDXJ) file used by pywb and OpenWayback for replay.

bash

# Generate a CDXJ index from a compressed WARC
cdxj-indexer crawl-00000.warc.gz > crawl-00000.cdxj

# Sanity-check record types and counts
warcio index crawl-00000.warc.gz -f warc-type | sort | uniq -c

How do I keep WARCs consistent across a collection?

Practice	Recommended default	Why
File extension	`.warc.gz`	Per-record gzip, seekable
File rollover size	~1 GB	Manageable transfer & fixity
warcinfo per file	Always	Documents provenance
Payload digest	SHA-1 or SHA-256	Enables dedup + fixity
Naming	`<collection>-<seq>.warc.gz`	Predictable indexing

Document these choices once and apply them collection-wide; inconsistency is what makes archives hard to defend years later.

How do I validate a WARC is well-formed?

Run a validator before you trust a file. warcio will fail loudly on truncated records, and a quick digest check catches silent corruption:

bash

# Verify each record's payload digest matches its declared digest
warcio check crawl-00000.warc.gz

A clean warcio check plus a successful replay in two engines is the practical definition of a "good" WARC.

Key Takeaways

WARC (ISO 28500) concatenates raw HTTP request/response records with metadata.
response and revisit records dominate real crawls; learn those first.
Revisit records deduplicate but break replay if their referenced record is deleted.
Use per-record gzip (.warc.gz) so indexers can seek directly to any record.
Write one warcinfo per file and a payload digest per record for provenance and fixity.
Validate with warcio check and confirm replay in two independent engines.

Frequently Asked Questions

What does WARC stand for and what standard defines it?

WARC means Web ARChive. It is defined by ISO 28500, which specifies a container format for concatenating the HTTP requests and responses captured during web crawling, along with metadata records.

What are the main WARC record types?

The core types are warcinfo, request, response, resource, metadata, revisit, conversion and continuation. In practice you mostly deal with response and revisit records, plus one warcinfo per file describing the capture.

What is a revisit record and why does it matter?

A revisit record points to a previously captured identical payload instead of storing it again, deduplicating repeated resources. It keeps collections small but means you must keep the original response record it references, or replay breaks.

Should WARC files be compressed?

Yes, the convention is per-record gzip, giving files ending in .warc.gz. Per-record compression lets indexers seek to any record without decompressing the whole file, which is essential for fast replay.

How big should a single WARC file be?

Crawlers commonly roll over to a new file at around 1 GB. This keeps individual files manageable for transfer, fixity checking and indexing without creating millions of tiny files.

What is inside a WARC record? ​

What are the record types you actually meet? ​

Why do revisit records matter so much? ​

How is a WARC compressed and indexed? ​

How do I keep WARCs consistent across a collection? ​

How do I validate a WARC is well-formed? ​

Key Takeaways ​

Frequently Asked Questions ​

What does WARC stand for and what standard defines it? ​

What are the main WARC record types? ​

What is a revisit record and why does it matter? ​

Should WARC files be compressed? ​

How big should a single WARC file be? ​

Related reading ​