Skip to content
Python for Historians

To scrape an archive website with Python, fetch each page with the requests library, parse the HTML with BeautifulSoup, extract the fields you need by CSS selector, and write the results to CSV or SQLite — all while pausing politely between requests and caching what you download. But before writing a line of scraping code, check for an official API, an IIIF manifest, or a downloadable dataset: those are faster, more stable, and explicitly sanctioned. Scraping is the tool of last resort, and this guide shows how to do it responsibly.

Should you scrape at all, or is there a better route?

Run through this checklist first:

  1. Does the archive offer an API or OAI-PMH endpoint? Catalogue systems like AtoM and Archives Hub often do.
  2. Is there a bulk download — a CSV, a data dump, a GitHub release?
  3. Are images exposed via IIIF? If so you can pull manifests as clean JSON.
  4. Only if all three fail should you parse HTML.

Scraping HTML is brittle: a redesign breaks your script overnight. APIs change far less often.

How do you fetch a page politely?

The golden rule is to behave like a considerate human, not a stampede. Identify yourself and slow down:

python
import requests, time

HEADERS = {
    "User-Agent": "HistoryResearch/1.0 ([email protected])"
}

def fetch(url):
    resp = requests.get(url, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    time.sleep(1.5)          # be kind to the server
    return resp.text

The contact email in the User-Agent matters: if your script misbehaves, an administrator can email you rather than silently blocking your institution's whole IP range.

How do you extract fields from the HTML?

Open the page in your browser, right-click a value and choose "Inspect" to find the element holding it. Then target it with BeautifulSoup:

python
from bs4 import BeautifulSoup

html = fetch("https://archive.example.org/item/12345")
soup = BeautifulSoup(html, "lxml")

record = {
    "reference": soup.select_one(".field-reference").get_text(strip=True),
    "title":     soup.select_one("h1.item-title").get_text(strip=True),
    "date":      soup.select_one(".field-date").get_text(strip=True),
    "url":       "https://archive.example.org/item/12345",
}

Prefer stable hooks like semantic class names over fragile ones like deeply nested div positions. If a field may be missing, guard it so one absent value does not crash a run of 5,000 records.

How do you walk through a result list?

Most catalogues paginate. Find the "next page" link and loop until it disappears, collecting item URLs as you go:

python
def crawl_results(start_url):
    url = start_url
    while url:
        soup = BeautifulSoup(fetch(url), "lxml")
        for a in soup.select("a.result-link"):
            yield a["href"]
        nxt = soup.select_one("a.pagination-next")
        url = nxt["href"] if nxt else None

Why should you cache every download?

Caching turns a fragile network job into a repeatable local one. Save the raw HTML the first time, and re-read from disk on every later pass:

python
import os, hashlib

def cached_fetch(url, cache_dir="cache"):
    os.makedirs(cache_dir, exist_ok=True)
    key = hashlib.md5(url.encode()).hexdigest() + ".html"
    path = os.path.join(cache_dir, key)
    if os.path.exists(path):
        return open(path, encoding="utf-8").read()
    html = fetch(url)
    open(path, "w", encoding="utf-8").write(html)
    return html

This lets you re-parse and fix extraction bugs without hammering the archive again — and gives you an audit trail of exactly what you downloaded and when.

How do you save the results cleanly?

Write to CSV for a quick spreadsheet, or SQLite when records cross-reference each other:

python
import csv

with open("records.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["reference", "title", "date", "url"])
    writer.writeheader()
    for url in crawl_results(START):
        writer.writerow(parse_item(url))

Always keep the source URL and a fetch timestamp in each row so the data stays auditable.

What are the common pitfalls?

  • Ignoring robots.txt and the terms of use.
  • Running in parallel threads, which looks like an attack.
  • Hard-coding fragile selectors that break on redesign.
  • Scraping rights-restricted images and republishing them.
  • Forgetting encoding, mangling accented names — set resp.encoding when needed.

Key Takeaways

  • Always prefer an API, OAI-PMH, IIIF, or bulk download over HTML scraping.
  • Identify yourself with a User-Agent containing contact details, and add a 1-2 second delay.
  • Parse with BeautifulSoup or lxml using stable CSS selectors.
  • Cache raw HTML so you can re-parse without re-downloading.
  • Store records with their source URL and a fetch timestamp for auditability.
  • Respect robots.txt, terms of use, and copyright in the records and images.

Frequently Asked Questions

It depends on the site's terms of use, copyright in the records, and your jurisdiction. Public-domain catalogue metadata is usually fine to harvest politely; check robots.txt and the terms, and never republish rights-restricted images without permission.

Should I use an API instead of scraping?

Always prefer an official API, IIIF manifest, or bulk data dump if one exists. They are stable, documented and sanctioned; scraping HTML should be your last resort when no structured access is offered.

How do I avoid overloading an archive's server?

Add a delay between requests (start at 1-2 seconds), run sequentially rather than in parallel, set a descriptive User-Agent with your contact details, and cache responses so you never fetch the same page twice.

What's the difference between requests and Selenium for scraping?

requests fetches the raw HTML the server sends, which is enough for most catalogues. Use Selenium or Playwright only when content is rendered by JavaScript after the page loads.

How do I extract data once I have the HTML?

Parse it with BeautifulSoup or lxml and target elements by CSS selector or XPath. Inspect the page in your browser's developer tools to find stable class names or table structures.

How should I store scraped records?

Write each record as a row to CSV or a SQLite database, keep the original HTML alongside it, and record the URL and a fetch timestamp so the data stays auditable and re-checkable.