Appearance
To get started with web archiving, install a browser-based capture tool such as ArchiveWeb.page, record the pages you care about while you browse them, and export a WACZ file you can replay offline. That single workflow gives you a standards-compliant archive in under ten minutes, with no server to maintain. Everything else in this guide is about scaling up and avoiding the mistakes that quietly corrupt collections.
What does "web archiving" actually mean?
Web archiving is the practice of capturing the HTTP responses a website sends — HTML, images, CSS, JavaScript, fonts, JSON API calls — and storing them so the page can be replayed later exactly as it was. The key distinction from a screenshot is fidelity: a screenshot is a picture, while a real web archive is the original traffic, so links still work and the layout is reconstructed by a replay engine.
The container for that traffic is the WARC format (ISO 28500). Each record holds a request or response plus headers and a timestamp. WACZ simply zips a set of WARCs together with an index and metadata so the whole thing moves as one file.
What tools should a beginner choose?
Start with low-friction tools and graduate as your needs grow:
| Tool | Type | Best for | Server needed |
|---|---|---|---|
| ArchiveWeb.page | Browser extension | First captures, single pages | No |
| Conifer | Hosted/manual | Interactive, logged-in pages | Hosted |
| Browsertrix Crawler | Automated crawler | Whole sites, scheduled jobs | Docker |
| pywb | Replay + capture | Hosting and viewing WARCs | Yes |
A realistic learning path is ArchiveWeb.page for week one, then Browsertrix Crawler once you need to capture more than a handful of pages.
How do I make my first capture step by step?
bash
# 1. Install the ArchiveWeb.page extension, then for an automated crawl:
docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl \
--url https://example.org \
--scopeType prefix \
--depth 1 \
--generateWACZ \
--collection my-first-archiveThat command captures example.org, follows links one level deep, stays inside the site (--scopeType prefix), and writes a .wacz to ./crawls/collections/my-first-archive/. Open the file in ReplayWeb.page to confirm it works.
How do I check a capture is good?
Open the archive in a fresh browser profile and click around. Watch for three failure modes: missing images (the crawler did not reach lazy-loaded assets), broken interactive widgets (JavaScript fired against a live API that was not recorded), and "live web" leakage where a resource loads from the internet instead of the archive. The browser console showing 404s against your replay host is the clearest signal of gaps.
What are the most common beginner mistakes?
- Capturing while logged in and forgetting it. Your session cookie may bake personal data into the WARC.
- Crawling without scope. A bare crawl can wander onto a CDN and balloon to gigabytes.
- Trusting a single tool's replay. Always re-open archives in an independent viewer; if it replays in both pywb and ReplayWeb.page, the WARC is sound.
- No metadata. Record the capture date, the seed URL and the tool version, or the archive becomes unciteable.
How should I store and name my archives?
Keep one WACZ per logical capture and name it predictably, e.g. org-example_2024-09-18_d1.wacz. Compute a checksum so you can detect rot later:
bash
sha256sum my-first-archive.wacz > my-first-archive.wacz.sha256Apply a 3-2-1 backup mindset from day one — even a single drive failure should never cost you a unique capture.
Key Takeaways
- The fastest start is ArchiveWeb.page recording to a WACZ; no servers required.
- A real web archive preserves HTTP traffic (WARC/WACZ), not a flat screenshot.
- Use
--scopeType prefixand--depthto stop crawls from wandering off-site. - Always verify replay in a second, independent viewer before trusting an archive.
- Record capture date, seed URL and tool version, or the archive is hard to cite.
- Checksum every WACZ and back it up; unique captures are irreplaceable.
Frequently Asked Questions
What is the simplest way to make a first web archive?
Install ArchiveWeb.page as a browser extension, click record, browse the pages you want, then download a single WACZ file. It requires no servers and produces a standards-based archive in minutes.
What file format should a web archive use?
Use WARC (ISO 28500) or its packaged form WACZ. Both are open, well-documented standards supported by replay tools like pywb and ReplayWeb.page, which protects your captures from tool lock-in.
Is saving a PDF or screenshot the same as web archiving?
No. A PDF or screenshot flattens the page into a static image and discards links, scripts and the original HTTP responses. A WARC preserves the actual server traffic so the page can be replayed interactively later.
Do I need permission to archive a website?
For personal or research reference, capturing a public page is usually low-risk, but redistribution can raise copyright and data-protection questions. Institutions typically rely on legal deposit, takedown policies or rights statements rather than seeking permission per site.
How much storage does web archiving need?
A single content-rich page is often 1 to 10 MB once images and fonts are included. A small focused crawl of a few hundred pages typically lands between 200 MB and 2 GB, so a laptop is fine to begin with.