How to Quality check web archive captures

To quality-check a web archive capture, replay it next to the live page, look for visual gaps, and open the browser console to catch 404s and live-web leakage. For a large crawl, analyse the status-code distribution, verify the seeds, then sample internal pages. The crawl log finishing "successfully" proves nothing about fidelity — only replay does.

What does "good" actually mean for a capture?

A high-quality capture is self-contained (every resource it needs is in the WARC), visually faithful (it looks like the original at capture time), and interactively sound (links, navigation and key widgets work on replay). The single most dangerous failure is live-web leakage: the page looks perfect because a missing image is quietly fetched from the live internet, and it will break the day that image changes.

How do I QA a single page fast?

Open the capture in ReplayWeb.page (or your pywb instance).
Put the live page in a second window and compare them side by side.
Open dev tools → Console and Network, then reload the replay.
Scan for red entries: 404 (missing from archive) and any request to a live-web hostname (leakage).
Click the main navigation and one or two internal links — do they resolve inside the archive?

A clean console plus a matching screenshot is a passing single-page QA.

How do I QA a whole crawl without opening every page?

Start from the status codes. The crawl produces a CDX/CDXJ index; summarise it:

bash

# Count HTTP status codes across the collection
zcat crawl-*.warc.gz | grep -a "^HTTP/" | awk '{print $2}' | sort | uniq -c

# Or from a CDXJ index, pull the status field
cut -d' ' -f4- crawl.cdxj | jq -r '.status' | sort | uniq -c

Read the distribution like a vital sign:

Pattern	Interpretation
Mostly `200`, some `301/302`	Healthy crawl
Cluster of `403`	Bot blocking / WAF
Cluster of `429`	Rate-limited — slow the crawl
`5xx` spikes	Server errors during capture
Unexpected `404`	Missing resources → replay gaps

Then sample: every seed plus a random handful of each templated page type (article, product, listing). If one article page is whole, similarly-built ones usually are too.

How do I detect live-web leakage reliably?

Leakage hides because it looks fine. To force it into the open, replay in a tool or mode that blocks the live web and watch what breaks. In pywb the archival mode blocks most live calls; anything that then fails to load was never actually captured. The console's failed-request list is your leakage report.

text

Console after blocking the live web:
  GET https://cdn.othersite.com/hero.jpg  net::ERR_BLOCKED   <- leakage found

What pitfalls trip people up most?

Trusting the crawl log. "0 errors" can still mean missing lazy-loaded assets.
Checking only the homepage. Templated inner pages are where systematic gaps live.
QA in the same browser that has live cache. Use a fresh profile so cached assets do not mask gaps.
One replay engine. Confirm in both pywb and ReplayWeb.page; a quirk in one is caught by the other.
No record of the QA. Log what you checked and the result, so the capture is defensible later.

How should I record QA results?

Keep a short, structured note per capture — it turns ad-hoc checking into a defensible process:

yaml

capture: org-example_2025-02-11_d2.wacz
qa_date: 2025-02-11
seeds_checked: 4/4 pass
sampled_pages: 12 (1 fail: /events/ leaked CDN image)
status_summary: 200=812, 301=44, 404=3
action: re-patch /events/ then re-QA

That note is what lets a colleague — or you in two years — trust the archive without re-doing the work.

Key Takeaways

Replay-and-compare plus a console check catches most issues in under a minute.
Live-web leakage is the silent killer; force it out by replaying with the live web blocked.
For crawls, read the status-code distribution first, then sample by page type.
A "successful" crawl log does not prove fidelity — only replay does.
Use a fresh browser profile and two replay engines to avoid masked gaps.
Record structured QA notes so each capture is defensible and re-checkable.

Frequently Asked Questions

What is the fastest way to QA a single capture?

Open it in ReplayWeb.page, compare it visually to the live page, then open the browser console and look for 404s or requests reaching the live web. Most quality problems surface within a minute of that check.

What does live-web leakage mean and why is it bad?

Leakage is when a replayed page loads a resource from the live internet instead of the archive, making it look complete when it is not. The archive will silently break once that live resource changes or disappears.

How can I QA a large crawl without checking every page?

Check the crawl's status-code distribution and seed pages programmatically, then sample. Audit the seeds and a random sample of internal pages, paying special attention to templated page types like article or product pages.

Which HTTP status codes signal a problem in a crawl?

A healthy crawl is mostly 200s with some intentional 301/302 redirects. Clusters of 403, 429 or 5xx point to blocking or rate-limiting, and unexpected 404s mean missing resources that will break replay.

Does a successful crawl log mean the capture is good?

No. A crawl can finish without errors yet miss lazy-loaded assets or leak to the live web. Visual and console verification on replay is the only reliable confirmation of fidelity.

What does "good" actually mean for a capture? ​

How do I QA a single page fast? ​

How do I QA a whole crawl without opening every page? ​

How do I detect live-web leakage reliably? ​

What pitfalls trip people up most? ​

How should I record QA results? ​

Key Takeaways ​

Frequently Asked Questions ​

What is the fastest way to QA a single capture? ​

What does live-web leakage mean and why is it bad? ​

How can I QA a large crawl without checking every page? ​

Which HTTP status codes signal a problem in a crawl? ​

Does a successful crawl log mean the capture is good? ​

Related reading ​