Skip to content
Web Archiving

To archive social media pages well, capture them with a browser-based, high-fidelity tool (Browsertrix Crawler, ArchiveWeb.page or Conifer) that scrolls and triggers the page's JavaScript, save the result as WARC/WACZ, and immediately record the URL, UTC timestamp, tool version and login state. Static crawlers alone will miss most posts because timelines load dynamically. The rest is discipline: a repeatable checklist so every capture in a collection is documented the same way.

Why are social media pages so hard to archive?

Three things break naive crawls. First, infinite scroll: the initial HTML contains almost no posts; they arrive via XHR/fetch as you scroll. Second, rate limiting and bot detection: platforms throttle or serve a login wall to anything that looks automated. Third, ephemerality: stories, live video and edited posts vanish, so you often get one chance.

The practical consequence is that you must drive a real browser, behave somewhat like a human (scroll, pause, expand replies), and accept that you are capturing a moment, not the canonical record.

Which tools actually work?

ToolStrengthBest for
Browsertrix CrawlerScriptable, headless Chromium, WACZ outBatch capture of many profiles
ArchiveWeb.pageInteractive browser extensionOne-off threads, manual QA
ConiferHosted, session recordingLogged-in or tricky pages
HeritrixScale, robustnessStatic news sites, NOT social feeds

For a collection, Browsertrix is the workhorse. For a single fragile thread you want to babysit, ArchiveWeb.page lets you scroll by hand and watch what is captured.

How do I run a Browsertrix capture of a feed?

Give the crawler behaviors and a generous timeout so the autoscroll has time to pull in lazy-loaded posts:

bash
docker run -v $PWD/crawls:/crawls -it webrecorder/browsertrix-crawler crawl \
  --url "https://example.social/@accounthandle" \
  --scopeType page \
  --behaviors autoscroll,autoplay,siteSpecific \
  --behaviorTimeout 120 \
  --pageLoadTimeout 60 \
  --generateWACZ \
  --collection accounthandle-2025-02

Then open the WACZ in ReplayWeb.page and actually scroll it before you call it done. A capture that "succeeded" but replays as an empty timeline is worse than no capture, because it looks complete.

A working capture checklist

Run this for every page:

  • [ ] Record the canonical URL (strip tracking params, keep the handle/post ID).
  • [ ] Note the capture start time in UTC and the local timezone.
  • [ ] Log tool name + version and whether you were logged in.
  • [ ] Confirm behaviors expanded replies and "show more" sections.
  • [ ] Replay the WACZ; spot-check the first, middle and last visible post.
  • [ ] Capture any linked media (images, video) — verify it plays in replay.
  • [ ] Compute a fixity checksum (SHA-256) of the final WACZ.
  • [ ] Write a one-line provenance note and store it with the file.

How do I document a capture so it stays defensible?

A capture is only evidence if a future reader can trust it. Keep a small sidecar record per item — JSON or a CSV row:

json
{
  "url": "https://example.social/@handle/status/1234567890",
  "captured_at_utc": "2025-02-18T14:32:07Z",
  "tool": "browsertrix-crawler 1.3.0",
  "logged_in": false,
  "wacz_sha256": "9f2c…",
  "operator": "E. Reed",
  "notes": "Autoscroll captured 41 replies; one quote-tweet image 404'd."
}

This is the difference between a screenshot folder and a citable archive: provenance, completeness notes and fixity travel with the bytes.

What about ethics and personal data?

Public does not mean consequence-free. Bystanders, minors and deleted-then-rearchived content all raise risks. Decide up front whether you will publish captures or keep them dark, and apply a takedown and review process. For sensitive accounts, capture but restrict access until a rights review is done.

Key Takeaways

  • Use a browser-based tool (Browsertrix, ArchiveWeb.page, Conifer); static crawlers miss dynamic feeds.
  • Always enable autoscroll behaviors and raise timeouts to 60-120s.
  • Replay and spot-check every capture before declaring success.
  • Record URL, UTC timestamp, tool version and login state for each item.
  • Add a SHA-256 fixity value so tampering is detectable later.
  • Treat public data as still subject to copyright, ToS and privacy law.
  • Standardise one checklist so a whole collection is documented identically.

Frequently Asked Questions

Why does the Wayback Machine fail on so many social media pages?

Modern social platforms render content with JavaScript and lazy-loading, and they aggressively rate-limit or block headless crawlers. Standalone Heritrix-style crawls capture the shell but miss the dynamically loaded posts, so a browser-based tool that scrolls the page is usually required.

Capturing publicly visible pages for non-commercial research and preservation is generally defensible in many jurisdictions, but terms of service, copyright and personal-data rules (e.g. GDPR) still apply. Document your legal basis and consult your institution's policy before publishing captures.

What tool should I use to archive a Twitter/X or Instagram thread?

Use a high-fidelity browser-based capturer such as Browsertrix Crawler, the ArchiveWeb.page extension or Conifer, which drive a real Chromium instance and scroll the timeline. They produce WARC/WACZ that replays the interactive page far better than a static crawler.

How do I capture comments and replies that load on scroll?

Configure behaviors that auto-scroll and click 'load more' affordances, give each post a generous page timeout (60-120s), and verify the replay before declaring success. In Browsertrix set --behaviors autoscroll,autoplay and raise --behaviorTimeout.

How should I name and describe a social media capture?

Record the canonical URL, the exact capture timestamp in UTC, the tool and version, the account handle, and whether you were logged in. Store this alongside the WACZ so a future researcher can judge completeness and provenance.

Can I archive content behind a login?

Technically yes, by capturing an authenticated browser session, but only do so with the account holder's consent or a clear legal mandate. Authenticated captures can expose private data of third parties, so they need stricter access controls.