Appearance
OpenRefine handles large datasets well into the hundreds of thousands of rows on an ordinary laptop, and into the low millions if you raise the JVM heap — but past a few million rows the browser interface becomes the bottleneck and a database or pandas pipeline serves you better. The practical workflow for big data is: give the JVM more memory, reduce what the UI renders, work on filtered subsets, and automate repeated runs headlessly. This guide walks the whole path end to end.
How many rows can OpenRefine actually take?
There is no hard limit, but useful rules of thumb for a 16 GB laptop:
| Rows | Experience | Recommendation |
|---|---|---|
| < 100k | Snappy, default heap fine | Just work normally |
| 100k–500k | Fine with 2–4 GB heap | Raise memory, fewer rows/page |
| 500k–2M | Workable but heavy | 8 GB heap, subset everything |
| > 2–3M | UI struggles | Split file or move to pandas/SQLite |
OpenRefine keeps the whole project in memory and replays operations, so RAM and distinct-value counts matter more than raw row count.
How do I give OpenRefine more memory?
The single highest-impact change is raising the maximum JVM heap (-Xmx). From the command line:
bash
# Linux / macOS
./refine -m 8192M
# Windows: edit refine.ini, set
REFINE_MEMORY=8192MSet it to roughly half your physical RAM, never more than you can spare for the OS. Restart OpenRefine for the change to take effect, then load your big file.
Why is it slow even when RAM is free?
Counter-intuitively, many slowdowns are UI-rendering, not memory. The browser repaints every visible cell and recomputes facets over the whole column. Fix it by reducing what is displayed:
- Set rows per page to 10 instead of 50.
- Remove facets you are not actively using — each one recomputes on every edit.
- Collapse columns you are not editing via the column header menu.
- Close other heavy browser tabs; OpenRefine is a local web app and competes for browser resources.
Should I split the file before importing?
For very large or naturally partitioned data, yes. Split by a natural key — year, parish, region — into chunks of a few hundred thousand rows:
bash
# split a 4M-row CSV into ~500k-row files, keeping the header
tail -n +2 big.csv | split -l 500000 - chunk_
for f in chunk_*; do cat header.txt "$f" > "${f}.csv"; doneClean one chunk, Extract its operations JSON, then apply the identical JSON to every other chunk. You get responsive projects and guaranteed consistency across the whole collection.
Do facets and clustering scale?
Faceting and clustering cost grows with both row count and the number of distinct values, so they are the slowest operations at scale. The practical pattern:
- Apply a text filter to narrow to the rows you suspect are dirty.
- Facet within that subset.
- Run Cluster on the subset — the key/value method is fast; ngram-fingerprint and nearest-neighbour are slower, so reserve those for smaller subsets.
- Merge, then move to the next subset.
Clustering 2 million distinct strings with nearest-neighbour can take many minutes; clustering a filtered 20,000 takes seconds.
Can I run big jobs without the UI?
For repeated or scheduled work, skip the browser entirely. The community openrefine-client drives a headless OpenRefine instance:
bash
openrefine-client --create big.csv --projectName=census
openrefine-client --apply operations.json --projectName=census
openrefine-client --export --output=clean.csv --projectName=censusThis is dramatically faster and more reliable than clicking through a multi-million-row project, and it makes the run reproducible.
Key Takeaways
- OpenRefine is comfortable to a few hundred thousand rows, stretches to low millions with more heap, and yields to pandas/SQLite beyond that.
- Raising
-Xmx(e.g.refine -m 8192M) is the biggest single performance lever. - Many slowdowns are UI rendering — cut rows-per-page, drop unused facets, collapse columns.
- Split very large files by a natural key and apply one operations JSON to every chunk.
- Facet and cluster on filtered subsets, not the whole dataset, to keep them fast.
- Use
openrefine-clientto run large, repeatable jobs headlessly.
Frequently Asked Questions
How many rows can OpenRefine realistically handle?
OpenRefine comfortably handles tens to a few hundred thousand rows on a typical laptop, and can manage low millions with extra heap memory. Beyond several million rows the browser UI becomes sluggish and a database or pandas pipeline is usually a better fit.
How do I give OpenRefine more memory for big datasets?
Increase the JVM heap by editing the maximum memory setting. On the command line launch with refine -m 4096M, or edit the ini/launcher so the -Xmx value rises from the default to something like 4G or 8G depending on your RAM.
Why does OpenRefine slow down on large projects even with spare RAM?
Most slowdowns come from rendering thousands of rows and from facets that compute over every value. Reducing the rows-per-page, removing unused facets, and collapsing the column display usually restores responsiveness without more memory.
Should I split a huge file before importing into OpenRefine?
Often yes. Splitting by year, region, or another natural key into chunks of a few hundred thousand rows keeps each project responsive, and you can apply the same operations JSON to every chunk for consistency.
Do facets and clustering get slower as the dataset grows?
Yes, clustering and faceting scale with the number of distinct values and rows, so they grow noticeably slower on large data. Facet a filtered subset, cluster on that, and apply merges, rather than clustering the full dataset at once.
Can I automate large OpenRefine jobs without the UI?
Yes. Tools such as the openrefine-client command-line utility let you create projects, apply an operations JSON, and export results headlessly, which is far faster and more reliable for repeated runs on big files.