When to Replay WARCs with pywb

Reach for pywb when you need a persistent, server-hosted, multi-collection web archive with stable URLs and access control — the kind of thing a library, archive or research group runs for years. If you only need to open one WACZ occasionally on a laptop, a serverless tool like ReplayWeb.page is the better fit and pywb's hosting overhead is not worth it. This guide is about making that call deliberately.

What problem does pywb solve?

pywb (Python Wayback) indexes WARC/WACZ files and serves them through a Wayback-style interface: a user requests an archived URL at a timestamp, and pywb rewrites and returns the captured response. It supports multiple collections, collection-level access rules, full-text and URL search via CDX, and stable canonical replay URLs. Those are the features you cannot easily get from a one-file, in-browser viewer.

When is pywb the right choice?

Choose pywb when two or more of these are true:

You have many WARCs/WACZs that must live in one searchable place.
You need durable, shareable URLs others will cite or link.
You require access control (embargoes, staff-only collections).
You will add captures over time and want one growing archive.
You want to run it on your own infrastructure for sovereignty or scale.

If most of those are false — say you just QA a single crawl — the lighter tool wins.

When should I not use pywb?

One-off viewing. Drag a WACZ into ReplayWeb.page; no install.
No server to run. pywb needs a host, a WSGI server and ongoing maintenance.
Strict offline/air-gapped sharing. A self-contained WACZ + ReplayWeb.page travels on a USB stick; a pywb instance does not.

pywb vs. ReplayWeb.page at a glance

Factor	pywb	ReplayWeb.page
Hosting	Server required	None (in-browser)
Collections	Many, managed	One file at a time
Stable shareable URLs	Yes	Limited
Access control	Yes	No
Setup effort	Higher	Near zero
Best for	Institutional archives	QA & ad-hoc viewing

How do I stand up pywb to test the fit?

bash

pip install pywb
wb-manager init my-collection
wb-manager add my-collection crawl-00000.warc.gz   # or a .wacz
wayback                                             # dev server on :8080

Open http://localhost:8080/my-collection/ and request an archived URL. For production you would front this with uWSGI/Gunicorn and a reverse proxy, plus config.yaml for access rules — that production gap is exactly the cost you are weighing.

How do I judge replay quality and leakage?

Replay fidelity is the real test of whether pywb fits your sources. Load a captured page, open the browser console, and look for requests that escape to the live web (they appear as external hostnames or unexpected 404s). pywb runs in an archival mode that blocks most live calls and rewrites URLs, but heavy client-side JavaScript can still try to phone home. If your sources are JavaScript-dense single-page apps, budget extra time to validate replay before committing to pywb as the public access layer.

What does pywb cost to run over time?

The honest costs are operational, not licensing (it is open source): a maintained host, periodic upgrades, index management as collections grow, and monitoring. For a steady archive these are modest; for a one-time project they are pure overhead. Match the tool to the lifespan of the access need, not just the capture task.

Key Takeaways

pywb fits persistent, multi-collection, server-hosted archives with stable URLs.
For one-off viewing of a single WACZ, ReplayWeb.page is lighter and sufficient.
The dominant cost of pywb is operational: a host, a WSGI server, upkeep.
pywb supports access control and search that client-side viewers cannot.
Always test replay and watch the console for live-web leakage on JS-heavy sites.
Decide based on the lifespan of the access need, not just the capture.

Frequently Asked Questions

What is pywb?

pywb (Python Wayback) is an open-source toolkit for indexing and replaying WARC and WACZ files. It powers self-hosted Wayback-style access and is the engine behind several institutional web archives.

When should I choose pywb over ReplayWeb.page?

Choose pywb when you need a persistent, multi-collection, server-hosted archive with access control and stable URLs. Choose ReplayWeb.page for serverless, in-browser replay of a single WACZ with zero infrastructure.

Does pywb need a server to run?

Yes. pywb runs as a Python web application, typically behind a server like uWSGI or Gunicorn with a reverse proxy. That hosting requirement is the main cost compared with client-side replay tools.

Can pywb replay WACZ files directly?

Yes. Recent pywb versions index and serve WACZ as well as raw WARC, so you can host either format. WACZ bundles its own index, which simplifies collection management.

Does pywb prevent live-web leakage during replay?

pywb rewrites URLs and runs in an archival mode that blocks most requests to the live internet, but complex JavaScript can still attempt live calls. Always test replay and watch the browser console for requests escaping the archive.

What problem does pywb solve? ​

When is pywb the right choice? ​

When should I not use pywb? ​

pywb vs. ReplayWeb.page at a glance ​

How do I stand up pywb to test the fit? ​

How do I judge replay quality and leakage? ​

What does pywb cost to run over time? ​

Key Takeaways ​

Frequently Asked Questions ​

What is pywb? ​

When should I choose pywb over ReplayWeb.page? ​

Does pywb need a server to run? ​

Can pywb replay WACZ files directly? ​

Does pywb prevent live-web leakage during replay? ​

Related reading ​