Appearance
To call a heritage API with Python, use the requests library to send a GET request to the service's endpoint with your search parameters, check the response status, and parse the returned JSON with .json(). Heritage APIs — Europeana, the Digital Public Library of America (DPLA), the V&A, Wikidata's SPARQL endpoint — give you clean, structured, sanctioned data without the fragility of scraping. The whole pattern is short; the craft is in pagination, rate limits, error handling and keeping your API key out of your code.
How do you make your first API call?
Start with a single, well-formed request. Here is Europeana's search API:
python
import requests, os
resp = requests.get(
"https://api.europeana.eu/record/v2/search.json",
params={
"wskey": os.environ["EUROPEANA_KEY"], # key from an env var
"query": "shipwreck",
"rows": 12,
},
timeout=30,
)
resp.raise_for_status()
data = resp.json()
print(data["totalResults"], "items")Three habits to adopt immediately: pass parameters via params= (so requests encodes them safely), always set a timeout, and call raise_for_status() so failures surface loudly instead of returning broken data.
Where should you keep your API key?
Never paste keys into the script you commit to Git. Read them from the environment:
bash
# set once in your shell or a .env file (which you .gitignore)
export EUROPEANA_KEY="your-key-here"python
key = os.environ["EUROPEANA_KEY"]A leaked key in a public repository can be abused and may get your access revoked — environment variables keep it out of your history.
How do you read the JSON that comes back?
A heritage API response is nested JSON. Explore it before assuming a structure:
python
for item in data["items"]:
print(item.get("title", ["(untitled)"])[0],
"—", item.get("dataProvider", ["?"])[0])Use .get() with a default for every field; heritage metadata is uneven, and one record missing a title should not crash a loop over thousands. Inspect a single record with import json; print(json.dumps(item, indent=2)) to learn the shape.
How do you page through thousands of results?
No API hands you 50,000 records at once. You request pages and follow a cursor or offset:
python
import time
def europeana_all(query, key, rows=100):
cursor = "*"
while cursor:
r = requests.get(
"https://api.europeana.eu/record/v2/search.json",
params={"wskey": key, "query": query, "rows": rows, "cursor": cursor},
timeout=30,
)
r.raise_for_status()
payload = r.json()
yield from payload.get("items", [])
cursor = payload.get("nextCursor") # None when finished
time.sleep(1) # be politeA generator like this streams results as they arrive, so memory stays flat even for very large harvests.
How should you handle errors and rate limits?
Treat HTTP status codes as a control flow signal, not an afterthought:
| Status | Meaning | What to do |
|---|---|---|
| 200 | OK | Proceed |
| 400 | Bad request | Fix your parameters; do not retry |
| 401 / 403 | Auth problem | Check your key |
| 429 | Too many requests | Wait, honour Retry-After, then retry |
| 500 / 503 | Server error | Back off and retry a few times |
python
for attempt in range(5):
r = requests.get(url, params=params, timeout=30)
if r.status_code == 429:
wait = int(r.headers.get("Retry-After", 5))
time.sleep(wait)
continue
r.raise_for_status()
breakRetrying transient errors with backoff turns a flaky long-running harvest into a reliable one.
Should you cache responses?
Yes — caching is the difference between a courteous researcher and an accidental denial-of-service. Save each page to disk keyed by its query, and re-read locally while you develop. You will re-run your parsing dozens of times; the API should see each unique request only once.
What about SPARQL endpoints like Wikidata?
Some heritage data lives behind SPARQL rather than a REST search. The pattern is the same — a GET with a query parameter, requesting JSON:
python
r = requests.get(
"https://query.wikidata.org/sparql",
params={"query": "SELECT ?item WHERE { ?item wdt:P31 wd:Q3947 } LIMIT 10",
"format": "json"},
headers={"User-Agent": "HistoryResearch/1.0 ([email protected])"},
timeout=60,
)
print(r.json()["results"]["bindings"])Wikidata in particular requires a descriptive User-Agent, and will block requests that omit one.
Key Takeaways
- Use
requests.getwithparams=, atimeout, andraise_for_status()as your default shape. - Read API keys from environment variables, never hard-code them in committed code.
- Use
.get()with defaults when reading uneven heritage JSON. - Page through large result sets with the API's cursor or offset, pausing between calls.
- Map HTTP status codes to actions: retry 429/503 with backoff, fix 400/401 yourself.
- Cache responses so each unique request hits the server only once.
- Send a descriptive
User-Agent, especially to Wikidata's SPARQL endpoint.
Frequently Asked Questions
What is a heritage API and why use one?
A heritage API is a sanctioned web endpoint that returns structured data — usually JSON — from a museum, archive, or library, such as Europeana, the V&A, or Wikidata's SPARQL service. Using it is more reliable and permitted than scraping the HTML site.
How do I call an API in Python?
Use the requests library: send a GET request with your parameters, check the status code, and call .json() on the response. Most heritage APIs need no more than a base URL, a few query parameters, and sometimes an API key.
Do I need an API key?
Often yes. Services like Europeana and the Digital Public Library of America require a free key passed as a parameter or header. Keep keys out of your code by reading them from an environment variable.
How do I get all the results when there are thousands?
Use the API's pagination: request a page, read the cursor or offset and total count from the response, then loop requesting the next page until you have everything, pausing politely between calls.
What should I do when an API call fails?
Check the HTTP status code. Retry transient errors like 429 or 503 with a short backoff, stop and read the message for 400 or 401, and always set a timeout so a hung request never freezes your script.
How do I avoid hitting rate limits?
Read the documented limit, add a delay between calls, cache responses so you never repeat a request, and watch for a 429 status with a Retry-After header telling you exactly how long to wait.