Scrape archive websites with Python: A Practical Guide

To scrape an archive website with Python, fetch each page with the requests library, parse the HTML with BeautifulSoup, extract the fields you need by CSS selector, and write the results to CSV or SQLite — all while pausing politely between requests and caching what you download. But before writing a line of scraping code, check for an official API, an IIIF manifest, or a downloadable dataset: those are faster, more stable, and explicitly sanctioned. Scraping is the tool of last resort, and this guide shows how to do it responsibly.

Should you scrape at all, or is there a better route?

Run through this checklist first:

Does the archive offer an API or OAI-PMH endpoint? Catalogue systems like AtoM and Archives Hub often do.
Is there a bulk download — a CSV, a data dump, a GitHub release?
Are images exposed via IIIF? If so you can pull manifests as clean JSON.
Only if all three fail should you parse HTML.

Scraping HTML is brittle: a redesign breaks your script overnight. APIs change far less often.

How do you fetch a page politely?

The golden rule is to behave like a considerate human, not a stampede. Identify yourself and slow down:

python

import requests, time

HEADERS = {
    "User-Agent": "HistoryResearch/1.0 ([email protected])"
}

def fetch(url):
    resp = requests.get(url, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    time.sleep(1.5)          # be kind to the server
    return resp.text

The contact email in the User-Agent matters: if your script misbehaves, an administrator can email you rather than silently blocking your institution's whole IP range.

How do you extract fields from the HTML?

Open the page in your browser, right-click a value and choose "Inspect" to find the element holding it. Then target it with BeautifulSoup:

python

from bs4 import BeautifulSoup

html = fetch("https://archive.example.org/item/12345")
soup = BeautifulSoup(html, "lxml")

record = {
    "reference": soup.select_one(".field-reference").get_text(strip=True),
    "title":     soup.select_one("h1.item-title").get_text(strip=True),
    "date":      soup.select_one(".field-date").get_text(strip=True),
    "url":       "https://archive.example.org/item/12345",
}

Prefer stable hooks like semantic class names over fragile ones like deeply nested div positions. If a field may be missing, guard it so one absent value does not crash a run of 5,000 records.

How do you walk through a result list?

Most catalogues paginate. Find the "next page" link and loop until it disappears, collecting item URLs as you go:

python

def crawl_results(start_url):
    url = start_url
    while url:
        soup = BeautifulSoup(fetch(url), "lxml")
        for a in soup.select("a.result-link"):
            yield a["href"]
        nxt = soup.select_one("a.pagination-next")
        url = nxt["href"] if nxt else None

Why should you cache every download?

Caching turns a fragile network job into a repeatable local one. Save the raw HTML the first time, and re-read from disk on every later pass:

python

import os, hashlib

def cached_fetch(url, cache_dir="cache"):
    os.makedirs(cache_dir, exist_ok=True)
    key = hashlib.md5(url.encode()).hexdigest() + ".html"
    path = os.path.join(cache_dir, key)
    if os.path.exists(path):
        return open(path, encoding="utf-8").read()
    html = fetch(url)
    open(path, "w", encoding="utf-8").write(html)
    return html

This lets you re-parse and fix extraction bugs without hammering the archive again — and gives you an audit trail of exactly what you downloaded and when.

How do you save the results cleanly?

Write to CSV for a quick spreadsheet, or SQLite when records cross-reference each other:

python

import csv

with open("records.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["reference", "title", "date", "url"])
    writer.writeheader()
    for url in crawl_results(START):
        writer.writerow(parse_item(url))

Always keep the source URL and a fetch timestamp in each row so the data stays auditable.

What are the common pitfalls?

Ignoring robots.txt and the terms of use.
Running in parallel threads, which looks like an attack.
Hard-coding fragile selectors that break on redesign.
Scraping rights-restricted images and republishing them.
Forgetting encoding, mangling accented names — set resp.encoding when needed.

Key Takeaways

Always prefer an API, OAI-PMH, IIIF, or bulk download over HTML scraping.
Identify yourself with a User-Agent containing contact details, and add a 1-2 second delay.
Parse with BeautifulSoup or lxml using stable CSS selectors.
Cache raw HTML so you can re-parse without re-downloading.
Store records with their source URL and a fetch timestamp for auditability.
Respect robots.txt, terms of use, and copyright in the records and images.

Frequently Asked Questions

Is it legal to scrape archive websites?

It depends on the site's terms of use, copyright in the records, and your jurisdiction. Public-domain catalogue metadata is usually fine to harvest politely; check robots.txt and the terms, and never republish rights-restricted images without permission.

Should I use an API instead of scraping?

Always prefer an official API, IIIF manifest, or bulk data dump if one exists. They are stable, documented and sanctioned; scraping HTML should be your last resort when no structured access is offered.

How do I avoid overloading an archive's server?

Add a delay between requests (start at 1-2 seconds), run sequentially rather than in parallel, set a descriptive User-Agent with your contact details, and cache responses so you never fetch the same page twice.

What's the difference between requests and Selenium for scraping?

requests fetches the raw HTML the server sends, which is enough for most catalogues. Use Selenium or Playwright only when content is rendered by JavaScript after the page loads.

How do I extract data once I have the HTML?

Parse it with BeautifulSoup or lxml and target elements by CSS selector or XPath. Inspect the page in your browser's developer tools to find stable class names or table structures.

How should I store scraped records?

Write each record as a row to CSV or a SQLite database, keep the original HTML alongside it, and record the URL and a fetch timestamp so the data stays auditable and re-checkable.

Should you scrape at all, or is there a better route? ​

How do you fetch a page politely? ​

How do you extract fields from the HTML? ​

How do you walk through a result list? ​

Why should you cache every download? ​

How do you save the results cleanly? ​

What are the common pitfalls? ​

Key Takeaways ​

Frequently Asked Questions ​

Is it legal to scrape archive websites? ​

Should I use an API instead of scraping? ​

How do I avoid overloading an archive's server? ​

What's the difference between requests and Selenium for scraping? ​

How do I extract data once I have the HTML? ​

How should I store scraped records? ​

Related reading ​

Should you scrape at all, or is there a better route?

How do you fetch a page politely?

How do you extract fields from the HTML?

How do you walk through a result list?

Why should you cache every download?

How do you save the results cleanly?

What are the common pitfalls?

Key Takeaways

Frequently Asked Questions

Is it legal to scrape archive websites?

Should I use an API instead of scraping?

How do I avoid overloading an archive's server?

What's the difference between requests and Selenium for scraping?

How do I extract data once I have the HTML?

How should I store scraped records?

Related reading