Appearance
The honest short answer: archiving a web page is usually a copy, copyright covers most web content, and whether your copy is lawful depends on where you are, why you are doing it, and what you then do with it. Beginners go wrong by treating capture and publication as one decision. Separate them. Capturing for preservation and research often falls under an exception; publishing the capture is a different, higher bar. This guide explains the core ideas in plain language and works one small example end to end.
Does copyright really cover an ordinary web page?
Yes — automatically. The text, photos, layout and even the source code are protected from the moment they are written, with no registration required. Archiving creates a reproduction, and reproduction is one of the rights reserved to the copyright owner. So every capture relies on either the owner's permission or a legal exception.
That sounds alarming, but most legitimate archiving stands on solid ground; you just need to know which leg you are standing on.
What legal bases can I stand on?
There are three common ones, and which applies depends on who you are:
| Basis | Who can use it | Typical limits |
|---|---|---|
| Legal deposit | Designated national libraries | National scope, often access-restricted |
| Copyright exception (fair use / fair dealing) | Researchers, libraries | Purpose, amount, market effect matter |
| Permission / licence | Anyone | Only covers what the owner can grant |
A national library archiving the country's web operates under legal deposit. A university researcher capturing pages for study leans on a research/fair-dealing exception. A project republishing captures publicly often needs permission or must rely carefully on fair use.
A small worked example
Say you want to archive a defunct local-history blog and put it in your institution's collection.
- Capture the blog for preservation. In most jurisdictions, making a preservation copy for a library/archive collection is permitted.
- Check personal data. The blog names living people in comments — flag it; data-protection law now applies on top of copyright.
- Decide on access. You may keep the full capture dark (preserved but not public) and provide reading-room access, which is far easier to justify than open publication.
- Document the basis. Record: "Preservation copy under [exception]; comments contain personal data; access restricted pending review."
Notice the capture happened early and confidently; the publishing decision was deliberately conservative.
Do I have to follow robots.txt?
Robots.txt is a convention, not legislation. Ignoring it is rarely illegal in itself, but it can undermine a fair-use posture and may breach a site's terms of service. The pragmatic default is to honour it for routine crawls and override it only with a clear, documented preservation rationale:
bash
# Browsertrix respects robots.txt by default; override deliberately, not by accident
browsertrix-crawler crawl --url https://example.org --useSitemap
# To ignore robots (use sparingly, document why):
# add --behaviors and review your legal basis firstWhere does GDPR fit in?
Copyright protects creators; data protection protects people. If a capture contains personal data about identifiable living individuals — names, photos, comments — then GDPR (EU/UK) or your local equivalent applies independently of copyright. You need a lawful basis, and people may have rights to access or erasure. This is the single most common blind spot for beginners, and it is why access controls and a review workflow are not optional.
How should I prepare for takedown requests?
Write the policy before you need it. A workable takedown process:
- Publish a contact route and a stated policy.
- Log every request with date and claim.
- Assess validity (is the claimant the rights holder? is the use exempt?).
- Respond with a proportionate action: restrict, redact, or remove-but-keep-dark.
Keeping a dark preservation copy after removing public access satisfies most complaints while protecting the historical record.
Key Takeaways
- Capture and publication are separate decisions — analyse them apart.
- Copyright covers ordinary web content automatically; archiving makes a copy.
- Your legal footing is one of: legal deposit, an exception, or permission.
- Legal deposit covers national libraries only, not general researchers.
- Robots.txt is a convention; ignoring it has reputational and ToS, not usually legal, weight.
- GDPR applies independently when captures hold personal data.
- Have a written takedown policy and keep dark copies after removal.
Frequently Asked Questions
Is it legal to archive a web page without permission?
It depends on jurisdiction, purpose and what you do with the copy. Many countries allow archiving and research uses under exceptions like fair use, fair dealing or legal deposit, but publishing or republishing the captured content can still infringe copyright. Always separate capturing from publishing in your analysis.
Does copyright apply to web pages?
Yes. Text, images, code and design on a web page are protected by copyright the moment they are created. Archiving makes a copy, which is one of the rights reserved to the owner, so you rely on an exception or permission for that copy to be lawful.
What is legal deposit and does it cover me?
Legal deposit is a statutory mandate that lets designated national libraries archive a country's published web content. It covers those institutions only; a researcher or small archive cannot invoke it and must rely on copyright exceptions or permission instead.
Do I have to obey robots.txt?
Robots.txt is a technical convention, not a law, so ignoring it is rarely itself illegal, but doing so can weaken a fair-use argument and breach a site's terms of service. Many archives crawl politely and honour robots.txt unless there is a strong preservation reason not to.
What about personal data and GDPR?
If captures contain personal data about identifiable living people, data-protection law (such as GDPR in the EU/UK) applies regardless of copyright. You need a lawful basis, and individuals may have rights over that data, which is why access controls and review processes matter.
How do I handle a takedown request?
Have a written takedown and review policy before you publish anything. When a request arrives, log it, assess whether the claim is valid, and respond — options include restricting access, redacting, or removing the item while keeping a dark preservation copy.