Skip to content
Web Archiving

Dynamic JavaScript sites are hard to archive because the content you see is not in the page's initial HTML — it is fetched by scripts after load, usually from APIs. The fix is to capture in a real browser that runs the JavaScript, while triggering every interaction (scrolling, clicking, opening each view) so the underlying network requests are recorded. Do that and the page replays offline; skip it and you archive an empty shell.

What makes a site "dynamic"?

A traditional page sends complete HTML in its first response. A dynamic site sends a small shell plus JavaScript; that script then calls back to the server (often a JSON API) to fetch the actual articles, images or feed. Think of an infinite-scroll timeline or a map that loads pins as you pan. The content exists only after the script runs and the extra requests complete.

You can spot this in your browser's developer tools: open the Network tab, reload, and watch a stream of fetch/XHR requests arrive after the HTML. Those requests are exactly what your archive must capture.

Why does a basic crawler miss this content?

A naive crawler reads the first HTML response and follows the <a href> links it finds. On a dynamic site that first response is nearly empty, and the "links" are built by JavaScript that the crawler never executes. So it captures the shell, finds nothing to follow, and stops. The recorded archive then shows a blank or skeletal page on replay.

How do I capture a dynamic page (worked example)?

The trick is to run the browser yourself or let an automated browser exercise the page. Here is the automated route with Browsertrix Crawler:

bash
docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl \
  --url https://example.org/feed \
  --scopeType prefix \
  --behaviors autoscroll,autoplay \
  --behaviorTimeout 90 \
  --pageLoadTimeout 90 \
  --generateWACZ \
  --collection dynamic-demo

autoscroll drives the page to the bottom so each lazy-loaded batch fires its API request, and the long timeouts give those requests time to finish recording. For the manual route, ArchiveWeb.page does the same when you scroll and click.

How do I know the capture is complete?

Open the result in ReplayWeb.page and reproduce what a visitor would do. Then open the browser console and watch for failures:

text
Good sign:  the feed renders, images appear, scroll loads more (from the archive)
Bad sign:   console shows "Failed to fetch /api/items?page=3" -> that API call
            was never captured, so that content is missing

Every red request in the console is a hole in your archive. Note the URL, then re-capture while specifically triggering that view.

What are the common beginner pitfalls?

PitfallWhat happensFix
No autoscrollInfinite feed stops at first batchEnable autoscroll
Timeout too shortAPI calls cut off mid-flightRaise behaviorTimeout
Skipping views/tabsTheir API responses never recordedVisit every view
Trusting one replay toolA subtle gap hidesReplay in two engines
Ignoring the consoleMissing data goes unnoticedWatch for failed fetches

Why might a captured single-page app still go blank?

Single-page apps route entirely in JavaScript, so one "page" may need several API endpoints. If even one endpoint was not visited during capture, the app may error and render nothing. The cure is exhaustive interaction: open each route, let it settle, scroll, and only then stop recording. When in doubt, capture more than you think you need — unused records cost little, missing ones are unrecoverable.

Key Takeaways

  • Dynamic content lives in post-load API calls, not the initial HTML.
  • Always capture in a real browser so the JavaScript runs and calls are recorded.
  • Use autoscroll and generous timeouts to trigger lazy and infinite-scroll content.
  • Visit every view and tab; an unvisited route means uncaptured data.
  • Verify in ReplayWeb.page and treat any failed console request as a gap.
  • For single-page apps, over-capture: missing endpoints cause blank replays.

Frequently Asked Questions

Why are JavaScript-heavy sites hard to archive?

Because the content is not in the initial HTML; it is fetched by scripts after the page loads, often from APIs. A simple crawler that reads the first HTML response sees an almost empty shell and misses the real content.

What is the single most important technique for dynamic sites?

Capturing in a real browser so the JavaScript actually runs, combined with behaviors like autoscroll that trigger lazy-loaded and infinite-scroll content. This records the API calls the scripts make, not just the shell.

Will my archived dynamic page work without the internet?

Only if every resource the scripts request — including JSON API responses — was captured during recording. Replay tools serve those recorded responses, so the page works offline as long as nothing it needs was missed.

Why does my archived single-page app show a blank screen on replay?

Usually a script tried to call an API endpoint that was never captured, so it gets an error and renders nothing. Re-capture while exercising every view, then check the console for failed requests.

Which tools handle dynamic sites well for a beginner?

ArchiveWeb.page and Conifer for human-driven capture, and Browsertrix Crawler with autoscroll behaviors for automated capture. All three run a real browser, which is what dynamic sites need.