Beginner's Guide to Legal issues in web archiving

The honest short answer: archiving a web page is usually a copy, copyright covers most web content, and whether your copy is lawful depends on where you are, why you are doing it, and what you then do with it. Beginners go wrong by treating capture and publication as one decision. Separate them. Capturing for preservation and research often falls under an exception; publishing the capture is a different, higher bar. This guide explains the core ideas in plain language and works one small example end to end.

Does copyright really cover an ordinary web page?

Yes — automatically. The text, photos, layout and even the source code are protected from the moment they are written, with no registration required. Archiving creates a reproduction, and reproduction is one of the rights reserved to the copyright owner. So every capture relies on either the owner's permission or a legal exception.

That sounds alarming, but most legitimate archiving stands on solid ground; you just need to know which leg you are standing on.

What legal bases can I stand on?

There are three common ones, and which applies depends on who you are:

Basis	Who can use it	Typical limits
Legal deposit	Designated national libraries	National scope, often access-restricted
Copyright exception (fair use / fair dealing)	Researchers, libraries	Purpose, amount, market effect matter
Permission / licence	Anyone	Only covers what the owner can grant

A national library archiving the country's web operates under legal deposit. A university researcher capturing pages for study leans on a research/fair-dealing exception. A project republishing captures publicly often needs permission or must rely carefully on fair use.

A small worked example

Say you want to archive a defunct local-history blog and put it in your institution's collection.

Capture the blog for preservation. In most jurisdictions, making a preservation copy for a library/archive collection is permitted.
Check personal data. The blog names living people in comments — flag it; data-protection law now applies on top of copyright.
Decide on access. You may keep the full capture dark (preserved but not public) and provide reading-room access, which is far easier to justify than open publication.
Document the basis. Record: "Preservation copy under [exception]; comments contain personal data; access restricted pending review."

Notice the capture happened early and confidently; the publishing decision was deliberately conservative.

Do I have to follow robots.txt?

Robots.txt is a convention, not legislation. Ignoring it is rarely illegal in itself, but it can undermine a fair-use posture and may breach a site's terms of service. The pragmatic default is to honour it for routine crawls and override it only with a clear, documented preservation rationale:

bash

# Browsertrix respects robots.txt by default; override deliberately, not by accident
browsertrix-crawler crawl --url https://example.org --useSitemap
# To ignore robots (use sparingly, document why):
#   add --behaviors and review your legal basis first

Copyright protects creators; data protection protects people. If a capture contains personal data about identifiable living individuals — names, photos, comments — then GDPR (EU/UK) or your local equivalent applies independently of copyright. You need a lawful basis, and people may have rights to access or erasure. This is the single most common blind spot for beginners, and it is why access controls and a review workflow are not optional.

How should I prepare for takedown requests?

Write the policy before you need it. A workable takedown process:

Publish a contact route and a stated policy.
Log every request with date and claim.
Assess validity (is the claimant the rights holder? is the use exempt?).
Respond with a proportionate action: restrict, redact, or remove-but-keep-dark.

Keeping a dark preservation copy after removing public access satisfies most complaints while protecting the historical record.

Key Takeaways

Capture and publication are separate decisions — analyse them apart.
Copyright covers ordinary web content automatically; archiving makes a copy.
Your legal footing is one of: legal deposit, an exception, or permission.
Legal deposit covers national libraries only, not general researchers.
Robots.txt is a convention; ignoring it has reputational and ToS, not usually legal, weight.
GDPR applies independently when captures hold personal data.
Have a written takedown policy and keep dark copies after removal.

Frequently Asked Questions

Is it legal to archive a web page without permission?

It depends on jurisdiction, purpose and what you do with the copy. Many countries allow archiving and research uses under exceptions like fair use, fair dealing or legal deposit, but publishing or republishing the captured content can still infringe copyright. Always separate capturing from publishing in your analysis.

Does copyright apply to web pages?

Yes. Text, images, code and design on a web page are protected by copyright the moment they are created. Archiving makes a copy, which is one of the rights reserved to the owner, so you rely on an exception or permission for that copy to be lawful.

What is legal deposit and does it cover me?

Legal deposit is a statutory mandate that lets designated national libraries archive a country's published web content. It covers those institutions only; a researcher or small archive cannot invoke it and must rely on copyright exceptions or permission instead.

Do I have to obey robots.txt?

Robots.txt is a technical convention, not a law, so ignoring it is rarely itself illegal, but doing so can weaken a fair-use argument and breach a site's terms of service. Many archives crawl politely and honour robots.txt unless there is a strong preservation reason not to.

If captures contain personal data about identifiable living people, data-protection law (such as GDPR in the EU/UK) applies regardless of copyright. You need a lawful basis, and individuals may have rights over that data, which is why access controls and review processes matter.

How do I handle a takedown request?

Have a written takedown and review policy before you publish anything. When a request arrives, log it, assess whether the claim is valid, and respond — options include restricting access, redacting, or removing the item while keeping a dark preservation copy.

Does copyright really cover an ordinary web page? ​

What legal bases can I stand on? ​

A small worked example ​

Do I have to follow robots.txt? ​

Where does GDPR fit in? ​

How should I prepare for takedown requests? ​

Key Takeaways ​

Frequently Asked Questions ​

Is it legal to archive a web page without permission? ​

Does copyright apply to web pages? ​

What is legal deposit and does it cover me? ​

Do I have to obey robots.txt? ​

What about personal data and GDPR? ​

How do I handle a takedown request? ​

Related reading ​