Skip to content
Data Cleaning with OpenRefine

OpenRefine cleaning becomes reproducible when you keep three artefacts together and under version control: the unmodified raw input, the operations JSON that records every transformation, and a note of the OpenRefine version and any external services you called. With those, a colleague — or you in two years — can replay the exact cleaning and reach the same output. The cleaned file alone is not enough; it is an endpoint with no provenance.

What actually makes a workflow reproducible?

Reproducibility is the ability to regenerate your result from inputs you have preserved. In OpenRefine that means three files living side by side:

text
project/
  raw/census-1881-raw.csv      # never edited
  cleaning/operations.json     # the recipe
  cleaning/environment.md       # OpenRefine 3.8.x, codefork VIAF endpoint, 2026-04
  output/census-1881-clean.csv # regenerable

If you can delete output/ and rebuild it from raw/ plus operations.json, you are reproducible. If you cannot, you are not.

How do I export the operations JSON?

This is the core habit. After cleaning:

  1. Open the Undo / Redo tab.
  2. Click Extract.
  3. Tick the operations to keep (usually all) and copy the JSON.
  4. Save it as operations.json next to the raw data.

To replay it on the same source, open a fresh project from the raw file, go to Undo/Redo, click Apply, and paste. A snippet of what you are preserving:

json
[
  {
    "op": "core/text-transform",
    "columnName": "surname",
    "expression": "value.toTitlecase().trim()"
  },
  {
    "op": "core/mass-edit",
    "columnName": "county",
    "edits": [ { "from": ["Yorks"], "to": "Yorkshire" } ]
  }
]

Why keep the recipe instead of just the cleaned file?

A cleaned CSV answers what but never how or why. The operations JSON is an auditable recipe: a reviewer reads it and sees every transformation, a referee can rerun it on corrected source data, and an archivist can verify nothing undocumented happened. For scholarship that depends on defensible data, the recipe is the evidence.

What silently breaks reproducibility?

Three things, ranked by how often they catch people:

HazardWhy it breaks replayMitigation
Manual single-cell editsRecorded but order-fragile and opaquePrefer transforms/mass-edit; document any manual fix
fetch URL / reconciliationDepends on a live remote serviceCache responses into columns; record endpoint + date
Undocumented versionGREL/behaviour can changeNote OpenRefine version in environment.md

Manual edits are captured in history, but a reviewer cannot understand a bare cell change without a note. Write one.

How do I make reconciliation and fetch steps replayable?

External calls are the deepest reproducibility trap because the remote data can change. The fix is to snapshot:

  • After reconciling, extract cell.recon.match.id and cell.recon.best.score into real columns so the result is frozen in your export.
  • After a fetch URL, keep the fetched payload as a column rather than re-fetching.
  • Record the endpoint URL and the date you ran it in environment.md.

Now even if VIAF or Wikidata changes tomorrow, your cached snapshot reproduces the original output.

Can I script the whole thing?

Yes — the gold standard. openrefine-batch and openrefine-client run OpenRefine headlessly so cleaning becomes a version-controlled pipeline step:

bash
openrefine-batch.sh \
  -a input/ \
  -b config/operations.json \
  -c output/ \
  -x "--format=csv"

Put operations.json and this command in your repository, and CI can re-run your cleaning on every change — true, automated reproducibility.

A reproducibility checklist

  • [ ] Raw input preserved unchanged.
  • [ ] operations.json extracted and committed.
  • [ ] OpenRefine version and external endpoints recorded.
  • [ ] Reconciliation/fetch results cached into columns.
  • [ ] Manual edits documented with a reason.
  • [ ] Replay tested from scratch (delete output, rebuild).

Key Takeaways

  • Reproducibility = raw input + operations JSON + environment note, all version-controlled.
  • Extract the operations JSON via Undo/Redo and store it beside the raw data.
  • Keep the recipe, not just the cleaned file — it is your auditable provenance.
  • Manual edits, live fetch/reconciliation, and undocumented versions break exact replay.
  • Snapshot external results into columns so remote changes cannot alter your output.
  • Script with openrefine-batch/openrefine-client to put cleaning under CI.

Frequently Asked Questions

What makes an OpenRefine cleaning workflow reproducible?

Reproducibility comes from keeping three things together: the unmodified raw input, the exported operations JSON that records every transformation, and a record of the OpenRefine version and any external services used. With these, anyone can replay your cleaning and get the same output.

How do I export my cleaning steps from OpenRefine?

Open the Undo / Redo tab and click Extract. Select the operations you want and copy the JSON. Save it as a versioned file such as operations.json alongside your raw data so the run can be replayed with Apply.

Why is the operations JSON better than just saving the cleaned file?

The cleaned file is an endpoint with no provenance, whereas the operations JSON is a transparent, auditable recipe. Reviewers can see exactly what you changed, re-run it on corrected source data, and trust the result.

What breaks reproducibility in OpenRefine?

Manual single-cell edits, fetch-URL and reconciliation steps that depend on live external services, and undocumented OpenRefine versions all break exact replay. Record or cache these so the same inputs always produce the same output.

How do I make reconciliation and fetch-URL steps reproducible?

Cache the responses: store the reconciliation results and fetched data as columns in your export, and note the service endpoint and date. If the live service changes later, your cached snapshot still reproduces the original result.

Can I run OpenRefine operations JSON in a scripted pipeline?

Yes. The openrefine-client and openrefine-batch tools let you create a project, apply operations JSON, and export headlessly, so your cleaning becomes a scripted step you can put under version control and rerun in CI.