Appearance
Choose Archive-It when you want crawls, storage, replay and a preservation copy handled for you and you lack dedicated engineering capacity; choose self-hosting when you have stable technical staff and a hard requirement — data residency, privacy, volume or custom crawling — that a hosted service cannot meet. For most small teams and one-off projects, Archive-It wins on total cost of ownership; for institutions with infrastructure and constraints, self-hosting earns its keep. The decision is about capacity and constraints, not features.
What does each option actually involve?
Archive-It is a subscription. You log in, define seeds and scope, schedule crawls, and the Internet Archive runs them, stores the WARCs, indexes them, and serves replay. Support and a preservation copy come bundled.
Self-hosting means assembling the open-source stack yourself: Browsertrix Crawler for capture, pywb for replay, OutbackCDX for indexing, plus storage, backups and monitoring. You own every layer — and every outage.
How do the costs really compare?
The sticker price is misleading. Archive-It's fee looks like pure cost; self-hosting looks free. But self-hosting consumes staff time, which is the most expensive resource in any archive.
| Cost factor | Archive-It | Self-hosting |
|---|---|---|
| Subscription fee | Annual, by data volume | None |
| Storage | Included | You provision + back up |
| Engineering time | Minimal | Ongoing, significant |
| Replay infrastructure | Managed | You run pywb |
| Preservation copy | At Internet Archive | Your responsibility |
| Upgrades / patching | Handled | You do it |
A rough rule: if you cannot name a person who owns the stack for the next five years, you cannot afford to self-host.
When should you NOT self-host?
Self-hosting is the wrong call when any of these are true:
- You have no reliable engineering capacity beyond the project's lifetime.
- The collection is small and your need is occasional.
- You cannot commit to 3-2-1 backups and routine fixity checks.
- You want a guaranteed off-site preservation copy without building one.
In these cases a hosted service is not a compromise — it is the responsible choice, because an unmaintained self-hosted archive bit-rots silently.
When does self-hosting clearly win?
Reach for the open-source stack when:
- Data residency / privacy: law or policy forbids third-party storage (sensitive, born-digital, or personal data).
- Volume: at very large scale, subscription pricing exceeds infrastructure cost.
- Custom crawling: you need behaviours, login flows or scoping the service does not expose.
- Integration: the archive must plug into an existing preservation repository (OAIS-aligned).
Can I switch later, or am I locked in?
You are not locked in at the data layer. Archive-It lets you export your WARC/WACZ. A migration path looks like:
bash
# After exporting WARCs from Archive-It, index and serve them locally
wb-manager init my-collection
wb-manager add my-collection ./exports/*.warc.gz
wayback # pywb now serves replay from your own infrastructureYou will rebuild indexes and replay, but the captured bytes — the irreplaceable part — travel with you. This makes "start hosted, migrate if you outgrow it" a perfectly sound strategy.
A quick decision checklist
Answer yes/no:
- Do you have funded engineering capacity for 5+ years? (No → Archive-It)
- Does policy forbid third-party storage? (Yes → self-host)
- Is the collection large and continuous? (Yes → lean self-host)
- Do you need a guaranteed external preservation copy you won't build? (Yes → Archive-It)
- Do you need custom crawl behaviours? (Yes → lean self-host)
Most "yes to hosted" answers means subscribe; most "yes to self-host" means build.
Key Takeaways
- The choice hinges on capacity and constraints, not feature lists.
- Archive-It bundles crawling, storage, replay, support and a preservation copy.
- Self-hosting trades the subscription fee for ongoing engineering time.
- You are not locked in: Archive-It WARCs export and ingest into pywb.
- Self-host for data residency, very high volume, or custom crawling.
- Never self-host without a named long-term owner and 3-2-1 backups.
- "Start hosted, migrate later" is a legitimate, low-risk path.
Frequently Asked Questions
What is the main difference between Archive-It and self-hosting?
Archive-It is a hosted subscription service from the Internet Archive that runs crawls, stores WARCs and provides replay for you. Self-hosting means you run the open-source stack (Browsertrix, pywb, OutbackCDX) on your own infrastructure and own the whole pipeline yourself.
Is self-hosting cheaper than Archive-It?
Only if you already have staff time and infrastructure. Archive-It's subscription bundles storage, replay, support and a guaranteed copy at the Internet Archive; self-hosting trades that fee for engineering hours, backups and on-call responsibility, which often costs more in practice for small teams.
Can I move my data out of Archive-It later?
Yes. Archive-It lets you download your WARC/WACZ files, so you are not locked in at the data level. You can ingest those WARCs into a self-hosted pywb instance if you decide to migrate, though you will rebuild indexes and replay yourself.
When does self-hosting clearly make sense?
When you have stable engineering capacity, strict data-residency or privacy requirements that forbid third-party storage, very high volumes, or a need for custom crawling behaviours the service does not expose. Born-digital and sensitive collections often push teams toward self-hosting.
Does Archive-It handle JavaScript-heavy and social media sites?
Archive-It has added browser-based crawling (Brozzler/Browsertrix-style) that handles many dynamic sites, but the hardest pages still need tuning. Self-hosting Browsertrix gives you finer control over behaviours, but you do the tuning yourself either way.
Who maintains a permanent copy of the data?
With Archive-It, a copy is preserved at the Internet Archive as part of the service. With self-hosting, preservation is entirely your responsibility, so you must implement a 3-2-1 backup strategy and fixity checking or risk silent loss.