Appearance
When a Browsertrix Crawler job misbehaves, the cause is almost always one of four things: scope too narrow, dynamic content not triggered, the crawl never terminating, or the browser running out of memory. This guide diagnoses each from its symptoms and gives the exact config keys that fix it, so you spend minutes not hours.
Why did my crawl only capture the seed page?
Symptom: the WACZ contains one page when you expected dozens. The cause is scope. Browsertrix only follows links allowed by scopeType, and depth: 0 captures the seed alone.
yaml
# crawl-config.yaml — follow in-site links one extra hop
seeds:
- url: https://example.org/news/
scopeType: prefix # stay under /news/, not the whole host
depth: 2 # seed -> linked pages -> their links
limit: 500 # hard page cap so it terminatesRun it with:
bash
docker run -v $PWD:/crawls/ webrecorder/browsertrix-crawler crawl \
--config /crawls/crawl-config.yaml --generateWACZ --collection newsIf you still get one page, the seed's links may be JavaScript-built — see the next section.
Why are images, fonts or whole sections missing?
Symptom: replay shows broken images or empty feeds. The resources are lazy-loaded and the crawler advanced before they fired. Three fixes, applied together:
bash
--behaviors autoscroll,autoplay \
--pageLoadTimeout 90 \
--behaviorTimeout 90 \
--postLoadDelay 5autoscroll drives the page to the bottom so lazy loaders trigger; the longer timeouts give those network calls time to complete. For sites with custom widgets, you may need a custom behavior script, but autoscroll resolves the majority of cases.
How do I stop a crawl that never ends?
Symptom: page count climbs into the thousands on a small site. The culprit is usually a crawler trap — calendars, faceted search, or session IDs in URLs generating infinite permutations. Bound the crawl and exclude the traps:
yaml
limit: 1000
exclude:
- "\\?.*calendar"
- "/search\\?"
- "sessionid="exclude takes regex matched against URLs. Adding a hard limit is your safety net even when an exclude pattern misses something.
Why does the browser crash or run out of memory?
Symptom: logs show Target closed, Page crashed, or the container is OOM-killed. Headless Chromium needs shared memory; Docker's default /dev/shm of 64 MB is far too small.
bash
docker run --shm-size=2g -v $PWD:/crawls/ \
webrecorder/browsertrix-crawler crawl \
--config /crawls/crawl-config.yaml \
--workers 2 # fewer parallel browsers = less RAMRule of thumb: budget roughly 1 to 1.5 GB of RAM per worker. If crashes persist, drop to --workers 1 and confirm the host is not swapping.
How do I crawl pages behind a login?
Create a reusable authenticated profile, then point the crawl at it:
bash
# 1. Interactively log in and save a profile
docker run -p 6080:6080 -v $PWD:/crawls/ \
webrecorder/browsertrix-crawler create-login-profile \
--url https://example.org/login
# 2. Reuse it in the crawl
docker run -v $PWD:/crawls/ webrecorder/browsertrix-crawler crawl \
--profile /crawls/profile.tar.gz --url https://example.org/members/Always confirm the site's terms permit archiving authenticated content before you do this.
A quick symptom-to-fix table
| Symptom | Likely cause | Fix |
|---|---|---|
| Only 1 page captured | Scope/depth too low | scopeType: prefix, raise depth |
| Broken images / empty feeds | Lazy load not triggered | --behaviors autoscroll, longer timeouts |
| Crawl never ends | Crawler trap | exclude regex + hard limit |
Page crashed / OOM | Too little shm/RAM | --shm-size=2g, fewer --workers |
| Login pages not captured | No auth session | create-login-profile + --profile |
Key Takeaways
- A one-page result almost always means scope is too narrow — fix
scopeTypeanddepth. - Missing assets are usually lazy-loaded;
autoscrollplus longer timeouts recover them. - Bound every crawl with
limitandexcludepatterns to defeat crawler traps. - Give Chromium room: raise
--shm-sizeand budget ~1.5 GB RAM per worker. - Use
create-login-profileto crawl authenticated pages, with permission. - Read the crawler logs first; the error string usually names the exact failure.
Frequently Asked Questions
Why did my Browsertrix crawl capture only one page?
Almost always the scope is too narrow. Check scopeType and depth — the default scope can exclude links you expected, and depth 0 captures only the seed. Set scopeType to prefix and raise depth to follow links.
Why are images and fonts missing from my crawl?
They are usually lazy-loaded and never triggered. Enable the autoscroll behavior and increase pageLoadTimeout and behaviorTimeout so the page finishes loading before the crawler moves on.
How do I stop a Browsertrix crawl from running forever?
Bound it with limits: set pageLimit, depth and a sensible scopeType, and add exclude patterns for calendars and faceted-search URLs that generate infinite link permutations.
Why does my crawl fail with an out-of-memory or browser crash error?
Headless Chromium is memory-hungry. Reduce the number of parallel workers, give the Docker container more shared memory with a larger shm-size, and lower the page concurrency.
How do I crawl a site that requires login?
Capture a browser profile that is already logged in using the create-login-profile tool, then pass that profile to the crawl so the authenticated session is reused. Confirm you are permitted to archive the logged-in content first.