Skip to content
Web Archiving

To make a web archive searchable you build two things: a CDX (URL) index so replay can locate any capture by address and date, and — if users need to search page content — a separate full-text index in Solr or OpenSearch. Start with the CDX; it is small, fast to build, and required for replay to work at all. This guide takes you from raw WARCs to a queryable archive, with the defaults that work and the canonicalisation pitfalls that cause the dreaded "not in archive" error.

What kind of index do you actually need?

Two different jobs, two different indexes:

  • URL index (CDX/CDXJ) — "show me example.com/page as it looked in 2024." This is mandatory; replay cannot work without it.
  • Full-text index — "show me every captured page mentioning flood defences." This is optional and far heavier.

Build the CDX first and confirm replay works before you even consider full-text. Most archives need only the CDX.

Step 1 — Build the CDX index

For a file-based CDXJ index over a folder of WARCs, cdxj-indexer is the simplest path:

bash
pip install cdxj-indexer
cdxj-indexer ./warcs/ > indexes/index.cdxj

Or let pywb manage it for you, which also wires up replay:

bash
wb-manager init my-collection
wb-manager add my-collection ./warcs/*.warc.gz   # indexes as it ingests
wayback   # replay at http://localhost:8080

A CDXJ line looks like this — note the SURT key (com,example) that groups the domain:

com,example)/page 20240927101502 {"url": "https://example.com/page", "mime": "text/html", "status": "200", "digest": "sha1:ABC…", "offset": "1234", "length": "5678", "filename": "crawl.warc.gz"}

Step 2 — Use OutbackCDX for scale

A flat CDXJ file is fine up to a few million records. Beyond that, or for a collection that keeps growing, run OutbackCDX as an index server so lookups stay fast and you can append without rewriting one giant file:

bash
java -jar outbackcdx.jar -p 8901 -d ./cdx-data &
# Load WARCs into a named index
zcat warcs/*.warc.gz | java -jar outbackcdx.jar ... # via the loader

Then point pywb at it. OutbackCDX answers prefix and range queries, which is what powers "all captures under this host" browsing.

Step 3 — Add full-text search (only if needed)

Full-text means extracting page text and feeding it to a search engine. The common stack is UKWA's webarchive-discovery producing Solr documents, or a lighter pipeline into OpenSearch:

bash
# Conceptual pipeline: WARC -> extracted text -> search docs
warc-to-text ./warcs/ | indexer --target http://localhost:8983/solr/fulltext

Budget for size and time: a full-text index can be several times larger than the WARCs and takes far longer to build than a CDX, so only do it when content search is a real requirement.

Why does replay say "not in archive"?

This is the number-one indexing bug, and it has two usual causes:

  1. Stale index — you added WARCs but did not re-index. Re-run the indexer.
  2. Canonicalisation mismatch — the requested URL canonicalises to a different SURT than the stored key (trailing slash, www, query-param order, scheme). Compare the SURT of both.
bash
# Check how a URL canonicalises before blaming the data
python -c "from surt import surt; print(surt('https://www.example.com/Page/?b=2&a=1'))"
# com,example)/page?a=1&b=2

If the SURTs differ, fix the canonicalisation rules rather than the WARCs.

A quick pre-flight checklist

  • [ ] All WARCs gzipped per-record (so offsets are valid).
  • [ ] CDX rebuilt after the last WARC was added.
  • [ ] SURT canonicalisation consistent between index and replay.
  • [ ] Replay confirmed on a known-good URL.
  • [ ] Full-text only attempted after CDX-based replay works.

Key Takeaways

  • Build the CDX index first; replay is impossible without it.
  • CDX indexes are small (a few percent of WARC size); full-text indexes are large.
  • Use cdxj-indexer/pywb for small collections, OutbackCDX for scale.
  • The SURT key groups captures by domain and powers prefix queries.
  • "Not in archive" usually means a stale index or canonicalisation mismatch.
  • Always re-index after adding WARCs and verify a known URL replays.
  • Add full-text search only when content search is a genuine requirement.

Frequently Asked Questions

What is a CDX index in web archiving?

A CDX index is a sorted, line-based listing of every capture in your WARC files, keyed by a canonicalised URL (SURT) plus timestamp, with byte offsets so a replay tool can jump straight to a record. It is the lookup table that makes URL-based replay and search fast.

URL indexing (CDX) lets you find captures by their address and time, which is what replay needs. Full-text search indexes the words inside the captured pages so users can search by content, and it requires a separate engine such as Solr or OpenSearch on top of the CDX.

Which tool should I use to build a CDX index?

Use cdxj-indexer or pywb's wb-manager for a file-based CDXJ index on small to medium collections, and OutbackCDX for a scalable, queryable index server on large or growing collections. pywb can read both.

How big does a CDX index get?

A CDX index is small relative to the WARCs — typically a few percent of the captured data, since each line is just a URL key, timestamp, metadata and an offset. Full-text indexes are much larger because they store tokenised page content.

What is a SURT and why does it matter?

SURT (Sort-friendly URI Reordering Transform) rewrites a URL so the host is reversed, e.g. example.com becomes com,example. This makes captures from the same domain sort together, which is essential for prefix queries and efficient lookups in the index.

Why does my replay say 'not in archive' when the page is captured?

Almost always an indexing or canonicalisation mismatch: the index was not rebuilt after adding WARCs, or the requested URL canonicalises differently from the indexed key. Re-run the indexer and check the SURT form of both URLs.