Troubleshooting: Crawl sites with Browsertrix

When a Browsertrix Crawler job misbehaves, the cause is almost always one of four things: scope too narrow, dynamic content not triggered, the crawl never terminating, or the browser running out of memory. This guide diagnoses each from its symptoms and gives the exact config keys that fix it, so you spend minutes not hours.

Why did my crawl only capture the seed page?

Symptom: the WACZ contains one page when you expected dozens. The cause is scope. Browsertrix only follows links allowed by scopeType, and depth: 0 captures the seed alone.

yaml

# crawl-config.yaml — follow in-site links one extra hop
seeds:
  - url: https://example.org/news/
scopeType: prefix      # stay under /news/, not the whole host
depth: 2               # seed -> linked pages -> their links
limit: 500             # hard page cap so it terminates

Run it with:

bash

docker run -v $PWD:/crawls/ webrecorder/browsertrix-crawler crawl \
  --config /crawls/crawl-config.yaml --generateWACZ --collection news

If you still get one page, the seed's links may be JavaScript-built — see the next section.

Why are images, fonts or whole sections missing?

Symptom: replay shows broken images or empty feeds. The resources are lazy-loaded and the crawler advanced before they fired. Three fixes, applied together:

bash

--behaviors autoscroll,autoplay \
--pageLoadTimeout 90 \
--behaviorTimeout 90 \
--postLoadDelay 5

autoscroll drives the page to the bottom so lazy loaders trigger; the longer timeouts give those network calls time to complete. For sites with custom widgets, you may need a custom behavior script, but autoscroll resolves the majority of cases.

How do I stop a crawl that never ends?

Symptom: page count climbs into the thousands on a small site. The culprit is usually a crawler trap — calendars, faceted search, or session IDs in URLs generating infinite permutations. Bound the crawl and exclude the traps:

yaml

limit: 1000
exclude:
  - "\\?.*calendar"
  - "/search\\?"
  - "sessionid="

exclude takes regex matched against URLs. Adding a hard limit is your safety net even when an exclude pattern misses something.

Why does the browser crash or run out of memory?

Symptom: logs show Target closed, Page crashed, or the container is OOM-killed. Headless Chromium needs shared memory; Docker's default /dev/shm of 64 MB is far too small.

bash

docker run --shm-size=2g -v $PWD:/crawls/ \
  webrecorder/browsertrix-crawler crawl \
  --config /crawls/crawl-config.yaml \
  --workers 2          # fewer parallel browsers = less RAM

Rule of thumb: budget roughly 1 to 1.5 GB of RAM per worker. If crashes persist, drop to --workers 1 and confirm the host is not swapping.

Create a reusable authenticated profile, then point the crawl at it:

bash

# 1. Interactively log in and save a profile
docker run -p 6080:6080 -v $PWD:/crawls/ \
  webrecorder/browsertrix-crawler create-login-profile \
  --url https://example.org/login

# 2. Reuse it in the crawl
docker run -v $PWD:/crawls/ webrecorder/browsertrix-crawler crawl \
  --profile /crawls/profile.tar.gz --url https://example.org/members/

Always confirm the site's terms permit archiving authenticated content before you do this.

A quick symptom-to-fix table

Symptom	Likely cause	Fix
Only 1 page captured	Scope/depth too low	`scopeType: prefix`, raise `depth`
Broken images / empty feeds	Lazy load not triggered	`--behaviors autoscroll`, longer timeouts
Crawl never ends	Crawler trap	`exclude` regex + hard `limit`
`Page crashed` / OOM	Too little shm/RAM	`--shm-size=2g`, fewer `--workers`
Login pages not captured	No auth session	`create-login-profile` + `--profile`

Key Takeaways

A one-page result almost always means scope is too narrow — fix scopeType and depth.
Missing assets are usually lazy-loaded; autoscroll plus longer timeouts recover them.
Bound every crawl with limit and exclude patterns to defeat crawler traps.
Give Chromium room: raise --shm-size and budget ~1.5 GB RAM per worker.
Use create-login-profile to crawl authenticated pages, with permission.
Read the crawler logs first; the error string usually names the exact failure.

Frequently Asked Questions

Why did my Browsertrix crawl capture only one page?

Almost always the scope is too narrow. Check scopeType and depth — the default scope can exclude links you expected, and depth 0 captures only the seed. Set scopeType to prefix and raise depth to follow links.

Why are images and fonts missing from my crawl?

They are usually lazy-loaded and never triggered. Enable the autoscroll behavior and increase pageLoadTimeout and behaviorTimeout so the page finishes loading before the crawler moves on.

How do I stop a Browsertrix crawl from running forever?

Bound it with limits: set pageLimit, depth and a sensible scopeType, and add exclude patterns for calendars and faceted-search URLs that generate infinite link permutations.

Why does my crawl fail with an out-of-memory or browser crash error?

Headless Chromium is memory-hungry. Reduce the number of parallel workers, give the Docker container more shared memory with a larger shm-size, and lower the page concurrency.

Capture a browser profile that is already logged in using the create-login-profile tool, then pass that profile to the crawl so the authenticated session is reused. Confirm you are permitted to archive the logged-in content first.

Why did my crawl only capture the seed page? ​

Why are images, fonts or whole sections missing? ​

How do I stop a crawl that never ends? ​

Why does the browser crash or run out of memory? ​

How do I crawl pages behind a login? ​

A quick symptom-to-fix table ​

Key Takeaways ​

Frequently Asked Questions ​

Why did my Browsertrix crawl capture only one page? ​

Why are images and fonts missing from my crawl? ​

How do I stop a Browsertrix crawl from running forever? ​

Why does my crawl fail with an out-of-memory or browser crash error? ​

How do I crawl a site that requires login? ​

Related reading ​

Why did my crawl only capture the seed page?

Why are images, fonts or whole sections missing?

How do I stop a crawl that never ends?

Why does the browser crash or run out of memory?

How do I crawl pages behind a login?

A quick symptom-to-fix table

Key Takeaways

Frequently Asked Questions

Why did my Browsertrix crawl capture only one page?

Why are images and fonts missing from my crawl?

How do I stop a Browsertrix crawl from running forever?

Why does my crawl fail with an out-of-memory or browser crash error?

How do I crawl a site that requires login?

Related reading