Appearance
To archive social media records, capture both how a post looked and the data behind it: use a tool like Browsertrix or Conifer to save the rendered page into a WARC file, keep any structured JSON export alongside it, and stamp each capture with a fixity checksum and the original URL the moment you take it. A social media record is the content plus its context — author, timestamp, replies and engagement counts — so preserving only the text loses half the evidence. This guide walks a beginner through the ideas and one small worked example.
What exactly are you trying to keep?
A social media "record" is more than a sentence of text. It is the post and the context that makes it meaningful: who wrote it, when, the thread it sits in, the images or video attached, the like and reply counts, and the URL it lived at. Archivists treat that context as part of the record. If you screenshot a tweet you have a picture; if you capture it properly you have evidence with provenance.
Capture the page or capture the data?
There are two complementary approaches, and beginners often think they must choose. You don't.
- Web capture (WARC): preserves the rendered page exactly as it appeared, including layout and images. This is what tools like Browsertrix produce.
- Data capture (JSON): preserves the structured fields — author, id, timestamp, counts — that are easy to search and analyse.
A strong record keeps both: the WARC shows what a human saw, the JSON gives you clean data. Where API access is restricted, the WARC alone is still a solid record.
A small worked example
Say you want to preserve one public post and its thread. Here is the whole flow, start to finish.
bash
# 1. Capture the rendered page into a WARC with a headless browser crawl
browsertrix-crawler crawl \
--url "https://example.social/user/status/12345" \
--scopeType page \
--collection post-12345
# 2. Stamp fixity the moment the capture finishes
sha256sum collections/post-12345/archive/*.warc.gz > post-12345.sha256
# 3. Record minimal provenance alongside it
cat > post-12345.meta.txt <<'EOF'
source_url: https://example.social/user/status/12345
captured: 2024-10-28T14:30:00Z
tool: browsertrix-crawler 1.x
EOFThat gives you a verifiable web capture, a checksum to prove it has not changed, and a note of where and when it came from — the three things every social media record needs.
How do you prove a capture is authentic?
Authenticity rests on three habits. Capture into WARC, which records the actual HTTP request and response rather than a re-typed copy. Generate a fixity checksum immediately, so any later change is detectable. And write down the capture date, tool, version and original URL. With those, you can stand behind the record years later.
Is it legal and ethical?
Public posts can generally be captured, but three things still apply: copyright (the author owns their words and images), platform terms of service, and data-protection law such as GDPR where living individuals are involved. The practical defaults: prefer public content, minimise the personal data you keep, document your lawful basis, and publish a takedown policy so people can ask you to remove material. When in doubt about a private or sensitive account, ask before you capture.
Why is social media so awkward to archive?
It fights you in predictable ways, and knowing them saves frustration.
| Challenge | Why it happens | Beginner workaround |
|---|---|---|
| Infinite scroll | Content loads as you scroll | Use a browser-based crawler that scrolls |
| Login walls | Platforms gate content | Capture only what is public, when you can |
| JavaScript rendering | Markup builds in the browser | Use Browsertrix/Conifer, not plain wget |
| Markup churn | Platforms redesign often | Re-test captures; expect breakage |
| API restrictions | Reduced research access | Fall back to web capture |
Key Takeaways
- A social media record is content plus context — author, timestamp, thread, counts and URL.
- Capture both the rendered page (WARC) and the structured data (JSON) where you can.
- Browsertrix and Conifer handle dynamic, JavaScript-heavy pages that plain downloaders cannot.
- Stamp every capture with a fixity checksum and note the date, tool, version and source URL.
- Respect copyright, platform terms and data-protection law; minimise personal data and keep a takedown policy.
- Expect breakage from infinite scroll, login walls and markup changes, and re-check your captures.
Frequently Asked Questions
What is a social media record?
It is the archivable content and context of a post or account: the text, images and video, plus the metadata around it — author, timestamp, likes, replies and the URL. The context is part of the record, not an extra.
Should I capture social media as web pages or as data?
Both have a place. A WARC web capture preserves how a post looked; an API or JSON export preserves the structured data behind it. For a robust record, capture the rendered page and keep any structured export alongside it.
Which tools archive social media for beginners?
Browsertrix and Conifer capture the rendered page into WARC files; the Wayback Machine's Save Page Now is the quickest one-off. For structured data, platform exports or research APIs give you JSON, though access has tightened in recent years.
Is it legal and ethical to archive someone's posts?
Public posts can usually be captured, but copyright, platform terms of service and data-protection law (such as GDPR for living individuals) still apply. Document your basis, minimise personal data, and have a takedown policy.
How do I prove a captured post is authentic?
Capture into WARC, which records the HTTP exchange, and generate a fixity checksum immediately. Note the capture date, the tool and version, and the original URL so the record is verifiable later.
Why is social media so hard to archive?
Pages are dynamic and JavaScript-heavy, content scrolls infinitely, login walls block crawlers, and platforms change their markup and restrict APIs often — so captures break and need checking.