Skip to content
Web Archiving

To scope a web archiving crawl, define the seeds, pick a scope type (host, prefix or domain), set a depth and page limit, and add include/exclude rules so the crawler captures exactly the boundary you intend and nothing more. Good scoping is the difference between a clean, citable collection and a runaway crawl that swallows a CDN. This guide takes you from seed list to a tested scope.

What is scoping and why does it decide success?

Scoping is the boundary definition of a crawl. It answers two questions for every URL the crawler discovers: do I capture this? and do I follow its links? Get the boundary too tight and you miss content; too loose and the crawl wanders onto unrelated sites, balloons in size, and may never finish. Scope is set with seeds, a scope type, depth, limits, and include/exclude rules.

Which scope type should I choose?

Scope typeBoundaryUse when
hostExact hostname onlyOne site, no subdomains
prefixUnder a URL pathOne section, e.g. /research/
domainRegistered domain + subdomainsWhole org incl. blog., docs.
pageSingle URLOne page only
anyNo host restrictionRarely — needs tight excludes

prefix is the workhorse for targeted captures. Reserve domain for whole-organisation crawls and never use any without aggressive excludes.

How do I write a scope configuration?

Here is a Browsertrix-style config scoping a crawl to one site section:

yaml
seeds:
  - url: https://example.org/research/
scopeType: prefix          # stay under /research/
depth: 4                   # how many hops from each seed
limit: 2000                # hard cap — always set one
include:
  - "https://example\\.org/research/.*"
exclude:
  - "/research/search\\?"   # faceted search = trap
  - "\\?print=1"            # duplicate print views
  - "/tag/"                 # low-value tag pages

include and exclude are regexes evaluated against each candidate URL; exclude wins ties. The limit is your seatbelt even when a rule misses something.

How do I stop the crawl escaping onto other sites?

Two distinct controls matter, and conflating them is the classic mistake:

  • Navigation scope — whether the crawler follows links to other hosts. Keep this tight (host/prefix).
  • Page requirements — whether it fetches assets (images, fonts, scripts) from other domains so the page renders. Keep this allowed, or your replay will be broken even though navigation stayed in-scope.

Most crawlers let you allow cross-domain page requirements while blocking off-site navigation. Use both: restrict where you go, permit what each page needs.

How do I handle crawler traps?

Traps are URL patterns that spawn endless links: event calendars (?date=2025-03), faceted search (?color=red&size=l), sort/paginate combinations, and session IDs. Defeat them with targeted excludes plus a page limit:

yaml
exclude:
  - "\\?.*date="          # calendar navigation
  - "\\?.*(sort|filter)=" # faceted listings
  - "sessionid="          # session-id explosions
limit: 2000               # absolute ceiling

If page counts climb far past your estimate, stop the crawl and inspect the queue for repeating parameter patterns — that is the trap announcing itself.

How do I test a scope before the full crawl?

Run a dry, shallow probe first: same seeds and scope but depth: 1 and a small limit, then inspect which URLs it queued. If off-site or trap URLs appear, tighten the rules and re-probe. Only when the probe queue looks clean do you launch the full-depth crawl. Ten minutes of probing routinely saves hours of re-crawling.

Key Takeaways

  • Scoping defines what the crawler captures and which links it follows.
  • Use prefix for a site section, domain for a whole organisation; avoid any.
  • Always set a hard page limit as a safety net regardless of other rules.
  • Separate navigation scope from page requirements: restrict travel, allow assets.
  • Defeat crawler traps with exclude regexes on calendar, facet and session URLs.
  • Probe with a shallow crawl and inspect the queue before committing to the full run.

Frequently Asked Questions

What does scoping a crawl mean?

Scoping defines the boundary of a crawl: which URLs are in scope and followed, and which are out of scope and skipped. It is set with seeds, a scope type, depth limits and include/exclude rules.

What is the difference between host, prefix and domain scope?

Host scope stays on the exact hostname; prefix scope stays under a URL path like /blog/; domain scope includes subdomains of the registered domain. Prefix is the usual choice for capturing one section of a site.

How do I keep a crawl from escaping onto other sites?

Set a restrictive scope type, avoid following external links unless intended, and add exclude rules for CDNs, social widgets and ad domains. A page limit acts as a final safety net.

What are crawler traps and how does scoping handle them?

Crawler traps are URL patterns that generate infinite links, such as calendars and faceted search. Scoping handles them with exclude regexes targeting query parameters and known trap paths, plus a hard page limit.

Should I capture page requirements from other domains?

Usually yes. Images, fonts and scripts hosted on CDNs are needed for the page to render, so allow page requirements across domains while still blocking navigation off-site. Most crawlers separate these two controls.