Skip to content
Data Cleaning with OpenRefine

OpenRefine is a free, local tool for cleaning messy tabular data: you install it, point your browser at http://127.0.0.1:3333, create a project from a CSV or spreadsheet, and start fixing inconsistencies through facets and transformations. Nothing is sent online and your original file is never altered. This guide gets you from download to a usable first result in about thirty minutes.

What is OpenRefine and when should you reach for it?

OpenRefine (formerly Google Refine) excels at one job: taming inconsistent, human-entered tabular data. If your spreadsheet has "London", "london" and "Londn" in the same column, or dates written six different ways, OpenRefine clusters and normalises them faster than any formula. It is not a database, not a charting tool, and not a replacement for pandas when you need scripted pipelines. Reach for it when the data is dirty and the rules only become clear as you explore.

How do you install and launch OpenRefine?

OpenRefine ships as a download for Windows, macOS and Linux, and requires a Java runtime (the bundled installers for recent 3.x versions include one). After installing:

  1. Launch the application — it starts a small local server.
  2. Your default browser opens automatically at http://127.0.0.1:3333.
  3. If it does not, open that address manually.

The interface lives entirely in the browser, but the engine runs on your machine. Closing the browser tab does not stop OpenRefine; quit the application window or terminal process to shut the server down.

Creating your first project

On the Create Project tab, choose This Computer, select your file, and click Next. The parsing preview is the most important screen new users rush past. Check:

  • The correct column separator is detected (comma, tab, semicolon).
  • Character encoding is UTF-8 — fix garbled accents here, not later.
  • Parse cell text into numbers/dates is left off for now, so nothing is silently coerced.

Give the project a clear name and click Create Project.

What are facets and why do they matter?

Facets are OpenRefine's superpower. A facet summarises every distinct value in a column so you can see the mess at a glance. To create one, click a column header and choose Facet > Text facet. A side panel lists each value with its count; click a value to filter rows, or hover and click edit to correct it in place.

text
Text facet on "country":
  United Kingdom   412
  UK                88   <- edit to "United Kingdom"
  U.K.              17
  england           3    <- a different problem entirely

Use a numeric facet to spot impossible values (a birth year of 18999) and a timeline facet to catch dates outside your expected range.

Cleaning a column without breaking anything

Two safe first moves on any text column, via Edit cells > Common transforms:

  • Trim leading and trailing whitespace — removes invisible spaces that split otherwise-identical values.
  • Collapse consecutive whitespace — turns double spaces into single.

For custom edits, Edit cells > Transform opens the GREL expression box. The expression value.trim().toTitlecase() standardises capitalisation. Crucially, every transform is reversible.

How does the undo history protect you?

The left-hand Undo / Redo tab logs every operation in order. Click any earlier step to roll the whole project back to that state — the steps after it are kept, not deleted, until you make a new change. You can also click Extract to copy the operations as JSON, then Apply that JSON to a fresh import of next month's data. That single feature turns ad-hoc cleaning into a repeatable recipe.

Default settings worth changing early

SettingDefaultRecommended for real work
Memory (-Xmx)~1.4 GB4 GB+ for files over 100k rows
Encoding on importauto-detectedforce UTF-8
Parse numbers/datesonoff until you have inspected the data
Project autosave5 minleave as is

Raise memory by editing refine.ini (or passing ./refine -m 4096M on Linux/macOS) before importing a large file.

Key Takeaways

  • OpenRefine runs locally in your browser; data stays on your machine and the source file is never modified.
  • Always verify separator and UTF-8 encoding in the import preview before creating the project.
  • Facets reveal inconsistency instantly — start every cleaning session by faceting suspect columns.
  • Trim and collapse whitespace first; many "duplicate" values differ only by invisible spaces.
  • The undo history makes every action reversible and can be exported as a replayable JSON recipe.
  • Raise the memory allocation to 4 GB or more before opening datasets above roughly 100,000 rows.

Frequently Asked Questions

Is OpenRefine free and where does my data go?

OpenRefine is free, open-source software released under the BSD licence. It runs locally on your own machine as a small server, so your data never leaves your computer unless you explicitly fetch URLs or reconcile against an external service.

Does OpenRefine need an internet connection?

No. The core cleaning, faceting and clustering features work entirely offline. You only need a connection for optional steps such as reconciliation, fetching URLs or installing extensions.

What file formats can OpenRefine import?

It reads CSV, TSV, fixed-width text, Excel (.xls and .xlsx), JSON, XML, RDF/N-Triples, ODS and Google Sheets. You can also paste clipboard text directly into a new project.

How much memory does OpenRefine need for large files?

The default allocation is around 1.4 GB. For files over roughly 100,000 rows you should raise the heap with the -Xmx flag or the refine.ini setting, giving it 4 GB or more if your machine allows.

Will OpenRefine change my original file?

Never. OpenRefine works on an internal copy inside its own project store, so your source file stays untouched. You only overwrite anything when you deliberately export the result.

Can I undo a mistake in OpenRefine?

Yes. Every operation is recorded in a full undo/redo history, and you can step back to any earlier state or extract the operations as JSON to replay them on another dataset.