Troubleshooting: Describe web archive collections

When the description of a web archive collection goes wrong, the cause is almost always one of four things: seed URLs that do not match what was captured, timestamps in the wrong timezone, description at the wrong level of granularity, or a missing scope note. Fix those four and most "broken metadata" complaints disappear. This guide walks each symptom back to its root cause and gives the concrete repair.

Why doesn't my description match what was captured?

This is the most common failure. You described the seed list but the crawler captured something else — redirects fired, the site canonicalised http to https, or scope rules pulled in or excluded subdomains.

The fix is to describe from the CDX index, not the seed list. The CDX is the ground truth of what is actually in your WARCs:

bash

# List the distinct hosts actually captured
zcat indexes/*.cdx.gz | awk '{print $1}' | sort -u

# Count captures per host to spot unexpected scope creep
zcat indexes/*.cdx.gz | awk '{print $1}' | sort | uniq -c | sort -rn | head

If a host appears that you never seeded, your scope was too broad; if a seed is missing, the crawl failed for it. Either way, your description should reflect reality.

How do I diagnose timestamp errors?

Symptom: crawl dates that are a few hours off, or that sort incorrectly. Root cause: a tool displayed UTC as local time, or you typed a local date into a UTC field.

WARC records store WARC-Date in UTC (e.g. 2025-04-03T09:14:02Z). Always normalise your descriptive crawl dates to ISO 8601 UTC and state the convention. A quick check:

bash

grep -m1 "WARC-Date" record.warc
# WARC-Date: 2025-04-03T09:14:02Z  ->  describe as 2025-04-03, UTC

What level should I describe at?

Over-description is a quiet time sink. Use this rule of thumb:

Level	Always?	Typical fields
Collection	Yes	Title, scope, date range, provenance, rights
Seed / site	Usually	Seed URL, captured host, crawl frequency
Document / page	Rarely	Only for landmark items

If you find yourself hand-cataloguing individual pages, stop — that effort almost never pays off in discovery, and it does not scale across a growing collection.

Why is my scope note causing confusion?

A vague scope note ("news websites, 2024") makes every gap ambiguous. Was a missing page deleted, blocked by robots, out of scope, or a crawl error? Researchers cannot tell. A good scope note is explicit about boundaries:

Included: the three named outlets' politics sections, weekly crawls Jan–Dec 2024, depth 2. Excluded: paywalled articles, third-party ad domains, embedded video. Known gaps: outlet B's 2024-06 crawl failed (rate-limited).

That last sentence — known gaps — is what turns a description into a research aid.

How do I describe an accruing collection?

Open collections need an open date range and explicit cadence. Record it once, then update the "last revised" field on each accrual:

yaml

title: "Regional Election Coverage 2024–"
scope: "Weekly crawls of three outlets' politics sections."
date_range: "2024-01-01/open"
crawl_frequency: "weekly"
last_revised: "2025-04-03"
status: "ongoing"

Which fields get skipped most often?

In order of how often they are missing and how much they hurt: scope, rights, provenance (who crawled it, with what tool), and known gaps. None of these are hard to record; they are simply boring, so they get dropped. Make them required fields in your template and the problem solves itself.

Key Takeaways

Describe from the CDX index, not the seed list — capture rarely equals request.
Store and display all crawl dates as ISO 8601 UTC to kill timezone bugs.
Describe at collection and site level; document-level description rarely pays off.
A scope note must state exclusions and known gaps, not just inclusions.
Model growing collections as open date ranges with a documented cadence.
Make scope, rights, provenance and gaps required template fields.
Pair a descriptive standard (Dublin Core / MODS) with WARC technical metadata.

Frequently Asked Questions

What metadata standard should I use to describe a web archive collection?

Use Dublin Core or MODS for descriptive metadata at the collection and seed level, and pair it with the WARC's own technical metadata. Many institutions add a thin local profile mapping seed URLs, crawl dates and scope notes on top of one of these standards.

Why do my seed URLs not match what was actually captured?

Redirects, canonicalisation and crawl-scope rules mean the requested seed often differs from the captured URL. Cross-check your seed list against the CDX index so your description records what was archived, not just what you asked for.

How granular should web archive description be?

Describe at the collection level always, at the seed/site level usually, and at the document level rarely. Most discovery happens at collection and site level, so over-describing individual pages wastes effort that is better spent on scope notes and provenance.

My crawl dates in the metadata look wrong — what happened?

Crawl dates are usually stored in UTC inside the WARC but displayed in local time by some tools, causing apparent off-by-hours errors. Always store and describe timestamps as ISO 8601 UTC, and note the timezone explicitly.

How do I describe a collection that grows over time?

Treat it as an open, accruing collection: record a date range with an open end, document the crawl frequency, and version the description. Note in the scope statement that the collection is ongoing and when it was last updated.

What is the single most overlooked field in web archive description?

Scope: what was deliberately included and excluded, and why. Without a scope note, researchers cannot tell whether a gap is a deletion, a crawl failure or an intentional boundary.

Why doesn't my description match what was captured? ​

How do I diagnose timestamp errors? ​

What level should I describe at? ​

Why is my scope note causing confusion? ​

How do I describe an accruing collection? ​

Which fields get skipped most often? ​

Key Takeaways ​

Frequently Asked Questions ​

What metadata standard should I use to describe a web archive collection? ​

Why do my seed URLs not match what was actually captured? ​

How granular should web archive description be? ​

My crawl dates in the metadata look wrong — what happened? ​

How do I describe a collection that grows over time? ​

What is the single most overlooked field in web archive description? ​

Related reading ​