Skip to content
Python for Historians

To call a heritage API with Python, use the requests library to send a GET request to the service's endpoint with your search parameters, check the response status, and parse the returned JSON with .json(). Heritage APIs — Europeana, the Digital Public Library of America (DPLA), the V&A, Wikidata's SPARQL endpoint — give you clean, structured, sanctioned data without the fragility of scraping. The whole pattern is short; the craft is in pagination, rate limits, error handling and keeping your API key out of your code.

How do you make your first API call?

Start with a single, well-formed request. Here is Europeana's search API:

python
import requests, os

resp = requests.get(
    "https://api.europeana.eu/record/v2/search.json",
    params={
        "wskey": os.environ["EUROPEANA_KEY"],   # key from an env var
        "query": "shipwreck",
        "rows": 12,
    },
    timeout=30,
)
resp.raise_for_status()
data = resp.json()
print(data["totalResults"], "items")

Three habits to adopt immediately: pass parameters via params= (so requests encodes them safely), always set a timeout, and call raise_for_status() so failures surface loudly instead of returning broken data.

Where should you keep your API key?

Never paste keys into the script you commit to Git. Read them from the environment:

bash
# set once in your shell or a .env file (which you .gitignore)
export EUROPEANA_KEY="your-key-here"
python
key = os.environ["EUROPEANA_KEY"]

A leaked key in a public repository can be abused and may get your access revoked — environment variables keep it out of your history.

How do you read the JSON that comes back?

A heritage API response is nested JSON. Explore it before assuming a structure:

python
for item in data["items"]:
    print(item.get("title", ["(untitled)"])[0],
          "—", item.get("dataProvider", ["?"])[0])

Use .get() with a default for every field; heritage metadata is uneven, and one record missing a title should not crash a loop over thousands. Inspect a single record with import json; print(json.dumps(item, indent=2)) to learn the shape.

How do you page through thousands of results?

No API hands you 50,000 records at once. You request pages and follow a cursor or offset:

python
import time

def europeana_all(query, key, rows=100):
    cursor = "*"
    while cursor:
        r = requests.get(
            "https://api.europeana.eu/record/v2/search.json",
            params={"wskey": key, "query": query, "rows": rows, "cursor": cursor},
            timeout=30,
        )
        r.raise_for_status()
        payload = r.json()
        yield from payload.get("items", [])
        cursor = payload.get("nextCursor")    # None when finished
        time.sleep(1)                          # be polite

A generator like this streams results as they arrive, so memory stays flat even for very large harvests.

How should you handle errors and rate limits?

Treat HTTP status codes as a control flow signal, not an afterthought:

StatusMeaningWhat to do
200OKProceed
400Bad requestFix your parameters; do not retry
401 / 403Auth problemCheck your key
429Too many requestsWait, honour Retry-After, then retry
500 / 503Server errorBack off and retry a few times
python
for attempt in range(5):
    r = requests.get(url, params=params, timeout=30)
    if r.status_code == 429:
        wait = int(r.headers.get("Retry-After", 5))
        time.sleep(wait)
        continue
    r.raise_for_status()
    break

Retrying transient errors with backoff turns a flaky long-running harvest into a reliable one.

Should you cache responses?

Yes — caching is the difference between a courteous researcher and an accidental denial-of-service. Save each page to disk keyed by its query, and re-read locally while you develop. You will re-run your parsing dozens of times; the API should see each unique request only once.

What about SPARQL endpoints like Wikidata?

Some heritage data lives behind SPARQL rather than a REST search. The pattern is the same — a GET with a query parameter, requesting JSON:

python
r = requests.get(
    "https://query.wikidata.org/sparql",
    params={"query": "SELECT ?item WHERE { ?item wdt:P31 wd:Q3947 } LIMIT 10",
            "format": "json"},
    headers={"User-Agent": "HistoryResearch/1.0 ([email protected])"},
    timeout=60,
)
print(r.json()["results"]["bindings"])

Wikidata in particular requires a descriptive User-Agent, and will block requests that omit one.

Key Takeaways

  • Use requests.get with params=, a timeout, and raise_for_status() as your default shape.
  • Read API keys from environment variables, never hard-code them in committed code.
  • Use .get() with defaults when reading uneven heritage JSON.
  • Page through large result sets with the API's cursor or offset, pausing between calls.
  • Map HTTP status codes to actions: retry 429/503 with backoff, fix 400/401 yourself.
  • Cache responses so each unique request hits the server only once.
  • Send a descriptive User-Agent, especially to Wikidata's SPARQL endpoint.

Frequently Asked Questions

What is a heritage API and why use one?

A heritage API is a sanctioned web endpoint that returns structured data — usually JSON — from a museum, archive, or library, such as Europeana, the V&A, or Wikidata's SPARQL service. Using it is more reliable and permitted than scraping the HTML site.

How do I call an API in Python?

Use the requests library: send a GET request with your parameters, check the status code, and call .json() on the response. Most heritage APIs need no more than a base URL, a few query parameters, and sometimes an API key.

Do I need an API key?

Often yes. Services like Europeana and the Digital Public Library of America require a free key passed as a parameter or header. Keep keys out of your code by reading them from an environment variable.

How do I get all the results when there are thousands?

Use the API's pagination: request a page, read the cursor or offset and total count from the response, then loop requesting the next page until you have everything, pausing politely between calls.

What should I do when an API call fails?

Check the HTTP status code. Retry transient errors like 429 or 503 with a short backoff, stop and read the message for 400 or 401, and always set a timeout so a hung request never freezes your script.

How do I avoid hitting rate limits?

Read the documented limit, add a delay between calls, cache responses so you never repeat a request, and watch for a 429 status with a Retry-After header telling you exactly how long to wait.