Best Practices to Scrape heritage data ethically

To scrape heritage data ethically, harvest only what a sanctioned channel cannot give you, identify yourself honestly in the User-Agent, throttle hard enough that a small archive's server never notices you, and keep a manifest that records exactly what you took and under what terms. Ethical scraping is less about clever code and more about restraint and documentation: a defensible harvest is one you could explain, line by line, to the institution whose collection you copied.

Should you scrape at all, or is there a sanctioned channel?

Before writing a single line of requests code, look for the front door. Cultural-heritage infrastructure is unusually generous with structured access, and scraping HTML when a clean feed exists is both rude and fragile.

IIIF — a manifest.json exposes images and metadata in a stable schema; harvest manifests, not gallery pages.
OAI-PMH — most repositories offer a ListRecords verb returning Dublin Core or MODS XML.
Bulk dumps — Europeana, the DPLA, and many national libraries publish full data exports under open licences.
APIs — check /api, the developer portal, or the data.gov-style catalogue.

Only fall back to scraping the rendered page when none of these covers the field you need.

How do you set request rates that respect a small archive?

Heritage sites are frequently hosted on shared university servers or single VPS boxes. Treat their bandwidth as a borrowed resource. Read robots.txt first, honour any Crawl-delay, and keep concurrency low.

python

import time, requests

HEADERS = {
    "User-Agent": "DigitalRelics-research/1.0 (+https://digitalrelics.uk; [email protected])"
}

def polite_get(url, session, delay=1.5):
    r = session.get(url, headers=HEADERS, timeout=30)
    if r.status_code in (429, 503):
        wait = int(r.headers.get("Retry-After", 30))
        time.sleep(wait)
        return polite_get(url, session, delay)
    time.sleep(delay)          # one request in flight, pause between each
    return r

A real name and contact address in the User-Agent matters: it lets a sysadmin email you instead of silently blocking your institution.

What metadata should you actually keep?

You rarely need the pixels. For most network analysis, mapping, or catalogue research, the metadata plus a persistent link is enough, and keeping only identifiers shrinks your copyright exposure to near zero.

Field	Keep it?	Why
Persistent identifier (ARK, DOI, handle)	Always	Lets anyone re-fetch the original
Rights statement (`rightsstatements.org` URI)	Always	Governs what you may republish
Descriptive metadata (title, date, creator)	Usually	The research payload
IIIF image URL	Usually	Re-derive the image on demand
The image file itself	Rarely	Storage and copyright cost; only if offline analysis needs it

How do you make a harvest reproducible and defensible?

Write a manifest alongside the data. If anyone — a peer reviewer, the source institution, your future self — asks what you did, the answer should be a file, not a memory.

yaml

# harvest-manifest.yaml
source: "Wellcome Collection IIIF"
base_url: "https://iiif.wellcomecollection.org/"
harvested_on: "2024-11-04"
robots_txt_sha256: "f3a1...c0"
rate_limit: "1 req / 1.5 s, 1 connection"
records_collected: 4218
rights_field: "usage.text"
code_commit: "a7c91e4"
licence_of_dataset: "CC-BY-4.0 (metadata only)"

What about robots.txt, terms of use, and database rights?

These three layers are independent and you must satisfy all of them. robots.txt is a machine-readable request you should honour even though it is not law. The site's terms of use are a contract that may forbid automated access outright. And in the UK and EU, the sui generis database right can protect a substantial extraction even when individual records are public-domain. When the terms are silent or ambiguous, email and ask — a one-line permission saves a takedown later.

How do you avoid harvesting sensitive or restricted material?

Public visibility is not the same as ethical reusability. A digitised parish register may name living people; a colonial photograph may depict communities under terms they never consented to. Filter on the rights statement during the harvest, not afterwards, and exclude anything marked restricted, with-conditions, or under embargo. When in doubt, keep the identifier but not the content, and flag the record for human review.

Key Takeaways

Prefer IIIF, OAI-PMH, bulk dumps, or APIs; scrape HTML only as a last resort.
Identify yourself with a real name and contact in the User-Agent.
Throttle to one or two requests with a 1 to 2 second delay; back off on 429 and 503.
Store metadata and persistent IDs rather than image files to cut copyright risk.
Record every harvest in a manifest so the run is reproducible and defensible.
Satisfy robots.txt, terms of use, and database rights as three separate obligations.
Filter on rights statements during the harvest to keep restricted material out.

Frequently Asked Questions

Is it legal to scrape a museum or archive website?

Scraping public pages is often lawful, but legality depends on the site's terms of use, the copyright status of the items, and your jurisdiction's database rights. Always check the terms and prefer a documented API or bulk download where one exists.

Should I scrape if an IIIF or OAI-PMH endpoint exists?

No. If a IIIF manifest, OAI-PMH feed, or data dump is published, harvest that instead. It is the sanctioned channel, returns clean structured metadata, and spares the institution's servers.

What request rate is polite for heritage sites?

Many small archives run on modest hardware, so cap concurrency at one or two requests in flight and add a one to two second delay. Honour any Crawl-delay in robots.txt and back off on HTTP 429 or 503.

Do I need to keep image files I scrape?

Often you only need the metadata and a stable IIIF or persistent URL, not the pixels. Storing only identifiers reduces your copyright exposure and your storage footprint.

How do I credit the source institution?

Record the rights statement, the persistent identifier, and the repository name for every record, and reproduce the institution's required attribution in any published dataset or visualisation.

What is a scraping manifest and why keep one?

It is a small file recording the base URL, date, robots.txt snapshot, rate limit, and code version of a harvest. It makes a run reproducible and lets you defend exactly what you collected and how.

Should you scrape at all, or is there a sanctioned channel? ​

How do you set request rates that respect a small archive? ​

What metadata should you actually keep? ​

How do you make a harvest reproducible and defensible? ​

What about robots.txt, terms of use, and database rights? ​

How do you avoid harvesting sensitive or restricted material? ​

Key Takeaways ​

Frequently Asked Questions ​

Is it legal to scrape a museum or archive website? ​

Should I scrape if an IIIF or OAI-PMH endpoint exists? ​

What request rate is polite for heritage sites? ​

Do I need to keep image files I scrape? ​

How do I credit the source institution? ​

What is a scraping manifest and why keep one? ​

Related reading ​