Fetch web data with OpenRefine: A Practical Guide

To fetch web data with OpenRefine, use Edit column > Add column by fetching URLs, build each request URL from your cell values with a GREL expression, set a polite throttle delay, and then parse the stored JSON or HTML responses into new columns with parseJson() or parseHtml(). Always keep the raw responses as your reproducible snapshot and respect each service's rate limits and terms.

What does "fetching URLs" let you do?

This feature turns a column of identifiers into enriched records. Given a column of place names you can call a geocoding API to add coordinates; given catalogue numbers you can pull metadata; given VIAF IDs you can retrieve authority data. OpenRefine issues one HTTP request per row, stores the response, and lets you extract fields from it — all without leaving the project.

How do you set up a fetch?

On the source column choose Edit column > Add column by fetching URLs. The dialog asks for an expression that builds the URL for each row. For a geocoding lookup against a hypothetical API:

text

"https://api.example.org/geocode?q=" + value.escape("url")

Note escape("url") — it percent-encodes spaces and accents so Stratford upon Avon becomes a valid query. Name the new column (e.g. geo_response) and set the throttle before running.

What throttle delay is safe?

The Throttle delay field sets milliseconds between requests. This is not optional politeness — it determines whether the job finishes or gets you blocked.

Service type	Suggested delay	Note
Public open API	1000–2000 ms	check documented rate limit
Heritage/cultural API	2000–5000 ms	often small servers
Unofficial scraping	5000 ms+	and reconsider whether you should

For 5,000 rows at a 2,000 ms delay, expect roughly three hours — fetching is slow by design. Run it when you can leave the machine alone.

How do you parse the responses?

The fetched column holds raw text (usually JSON). Add a column based on it: Edit column > Add column based on this column, then extract with GREL.

For JSON:

text

value.parseJson().results[0].geometry.location.lat

For HTML, when no API exists:

text

value.parseHtml().select("h1.title")[0].xmlText()

Build one extraction column per field you need (latitude, longitude, label) so each stays auditable.

What about failures and partial results?

Some rows will fail — bad input, a 404, a rate-limit 429. Failed cells contain an error string or an error body, not data. To recover:

Facet geo_response (a text or custom facet on value.startsWith("Error") or on the HTTP status in the body).
Isolate the failed rows.
Fix their input or raise the throttle, and re-fetch only those rows rather than the whole column.

Re-fetching everything wastes the API's goodwill and your time.

Why keep the raw responses?

Treat the response column as a cached snapshot, not scratch space. APIs change, go offline, or return different data tomorrow; the stored responses are what make your enrichment reproducible. Keep geo_response in the project (and in your export if licensing allows) alongside the parsed columns, and note the fetch date. If you must reproduce the dataset, you parse from the snapshot rather than hitting the live service again.

Is it allowed? Ethics and law in one place

Before any fetch, check the service's terms of service, its robots.txt, and any API licence. Practical rules:

Prefer official APIs over scraping HTML.
Identify yourself with a contact user agent if the service asks.
Throttle generously — heritage APIs often run on modest hardware.
Minimise personal data: fetch only what your project lawfully needs, and mind GDPR for any living individuals.

Good citizenship here keeps these shared services available for everyone.

Key Takeaways

Use Add column by fetching URLs, building each URL from cell values and escaping input with escape("url").
Set a throttle delay of 1–5 seconds; it decides whether the job completes or gets blocked.
Parse responses with parseJson() for JSON or parseHtml() with select() for HTML, one field per column.
Facet failed responses and re-fetch only those rows instead of the whole column.
Keep the raw response column as a dated, reproducible snapshot rather than discarding it.
Check terms, robots rules and licences; prefer official APIs and minimise personal data.

Frequently Asked Questions

How do I fetch data from an API in OpenRefine?

Use Edit column then Add column by fetching URLs. Build the request URL from your cell values with a GREL expression, set a throttle delay, and OpenRefine stores each response. You then parse the responses into new columns.

What throttle delay should I set when fetching URLs?

Start at 1000 to 5000 milliseconds per request and check the API's rate-limit and terms first. A polite delay avoids being blocked and is the difference between a job that completes and one that gets your IP throttled.

How do I parse the JSON that OpenRefine fetched?

Add a new column based on the response column and use GREL's parseJson, for example value.parseJson().results[0].name. For XML or HTML use parseHtml together with select and xmlText to extract the elements you need.

What if some URL fetches fail or return errors?

Failed cells contain an error or an HTTP error body rather than data. Facet the response column to find non-200 responses, fix the URLs or back off the rate, and re-fetch only the failed rows rather than the whole column.

Should I cache fetched responses?

Yes. Keep the raw response column in the project rather than discarding it after parsing. Re-running a fetch hits the API again and may differ or fail, so the stored responses are your reproducible snapshot.

Is it ethical and legal to scrape with OpenRefine?

Check the site's terms of service, robots rules and any API licence before fetching. Prefer official APIs, identify yourself if a user agent is requested, throttle politely, and never fetch personal data beyond what your project lawfully needs.

What does "fetching URLs" let you do? ​

How do you set up a fetch? ​

What throttle delay is safe? ​

How do you parse the responses? ​

What about failures and partial results? ​

Why keep the raw responses? ​

Is it allowed? Ethics and law in one place ​

Key Takeaways ​

Frequently Asked Questions ​

How do I fetch data from an API in OpenRefine? ​

What throttle delay should I set when fetching URLs? ​

How do I parse the JSON that OpenRefine fetched? ​

What if some URL fetches fail or return errors? ​

Should I cache fetched responses? ​

Is it ethical and legal to scrape with OpenRefine? ​

Related reading ​

What does "fetching URLs" let you do?

How do you set up a fetch?

What throttle delay is safe?

How do you parse the responses?

What about failures and partial results?

Why keep the raw responses?

Is it allowed? Ethics and law in one place

Key Takeaways

Frequently Asked Questions

How do I fetch data from an API in OpenRefine?

What throttle delay should I set when fetching URLs?

How do I parse the JSON that OpenRefine fetched?

What if some URL fetches fail or return errors?

Should I cache fetched responses?

Is it ethical and legal to scrape with OpenRefine?

Related reading