Cluster and dedupe with OpenRefine: A Practical Guide

To cluster and dedupe with OpenRefine, first normalise inconsistent spellings with the Cluster feature inside a text facet, then remove genuine duplicate rows using a sort-and-blank-down pattern. Clustering fixes values; deduplication removes rows — they are two distinct jobs, and doing them in that order is what makes the result reliable.

Why clustering comes before deduplication

If you delete duplicates first, "Smith, John" and "John Smith" survive as two records because they are not byte-identical. Clustering collapses those variants into one canonical form, which exposes the real duplicates. So the workflow is always: standardise the key column, then remove repeats. Skipping the order is the single most common reason a dedupe job leaves duplicates behind.

How do OpenRefine's clustering methods differ?

Open a text facet, then click Cluster in the panel header. You get two families of method:

Key collision — generates a key for each value and groups values sharing a key. Keying functions: fingerprint (default), ngram-fingerprint, metaphone3, cologne-phonetic, daitch-mokotoff.
Nearest neighbour — compares values pairwise by edit distance. Functions: levenshtein and PPM, both controlled by a radius and block chars parameter.

Method	Speed	Catches	False positives	Run order
Fingerprint	Very fast	case, punctuation, word order	Very low	First
N-gram fingerprint	Fast	minor transpositions	Low	Second
Metaphone3 / phonetic	Fast	sound-alike spellings	Medium	For names
Levenshtein / PPM	Slow	typos, OCR errors	Higher	Last, reviewed

What does a fingerprint actually do?

The fingerprint keying function lowercases the text, strips punctuation, splits into tokens, sorts them, removes duplicates and rejoins them. So all of these collapse to one key:

text

"University of Oxford"   -> "of oxford university"
"OXFORD, University of"  -> "of oxford university"
"oxford  university"     -> "of oxford university"

That is why fingerprint is safe as a first pass: it only merges values that are demonstrably the same words in a different shape.

Walking through a real cluster review

For a column of place names from parish registers:

Facet > Text facet on the place column.
Click Cluster.
Method key collision, keying fingerprint. Review each proposed group.
Tick Merge? only where you are confident; edit the New Cell Value to your authoritative form.
Click Merge Selected & Re-Cluster to keep going on the reduced set.
Switch the method to nearest neighbour, PPM, radius 1.0, and review the typo clusters.

Never tick Select All blindly — phonetic and nearest-neighbour methods will occasionally group "Newton" with "Newtown", which are different villages.

How do you actually remove duplicate rows?

OpenRefine has no one-click "remove duplicates" button. The canonical pattern, once your key column is clustered clean:

Sort by the key column (so duplicates are adjacent) and make the sort permanent via Sort > Reorder rows permanently.
On the key column, Edit cells > Blank down — this empties cells identical to the one above.
Facet > Customized facets > Facet by blank on that column, select true (the now-blank repeats).
All > Edit rows > Remove matching rows.

The first occurrence of each value keeps its data; the blanked repeats are removed.

Verifying the result and exporting the recipe

After dedupe, re-facet the key column: the number of distinct values should equal your remaining row count if every key is unique. Then open the Undo / Redo tab and click Extract to save the clustering and dedupe operations as JSON. Next year's register can be cleaned by re-importing and clicking Apply — a documented, repeatable process rather than a one-off scrub.

Key Takeaways

Cluster (normalise values) first, then deduplicate (remove rows) — order is what makes dedupe complete.
Begin with the fingerprint key-collision method; it is fast and almost never produces false merges.
Reserve nearest-neighbour methods for typos and OCR noise, and review every cluster they propose.
Clustering edits values within one column and never deletes rows by itself.
Remove duplicates with sort, then blank-down, then facet-by-blank, then remove matching rows.
Extract the operation history as JSON so the same cleaning runs reproducibly on future data.

Frequently Asked Questions

What is the difference between clustering and deduplication in OpenRefine?

Clustering groups cell values that probably refer to the same thing so you can merge their spellings into one. Deduplication removes whole rows that are genuine duplicate records. Clustering normalises values; dedupe removes rows.

Which clustering method should I start with?

Start with key collision using the fingerprint keying function. It is fast, conservative and catches case, punctuation and word-order variants with almost no false positives, making it safe to apply across a large column quickly.

When should I use nearest-neighbour clustering instead of key collision?

Use nearest-neighbour (PPM or Levenshtein) when typos and single-character errors remain after key collision. It is slower and needs a tuned radius, so run it second and review every cluster it proposes.

How do I remove duplicate rows in OpenRefine?

Sort or facet by a unique key, use blank-down to flag repeats, then create a facet on the blanked column and remove all matching rows. OpenRefine has no single delete-duplicates button, so this two-step pattern is the standard approach.

Does clustering merge rows or just edit values?

Clustering only edits cell values within one column; it never merges or deletes rows. Choosing a single value for a cluster simply rewrites every member to that value, which then makes true duplicates visible.

Can I trust the merged value OpenRefine suggests?

Treat the suggested value as a default, not a verdict. OpenRefine proposes the most frequent or longest form, but you should set the authoritative spelling yourself, especially for historical names and places where frequency is not authority.

Why clustering comes before deduplication ​

How do OpenRefine's clustering methods differ? ​

What does a fingerprint actually do? ​

Walking through a real cluster review ​

How do you actually remove duplicate rows? ​

Verifying the result and exporting the recipe ​

Key Takeaways ​

Frequently Asked Questions ​

What is the difference between clustering and deduplication in OpenRefine? ​

Which clustering method should I start with? ​

When should I use nearest-neighbour clustering instead of key collision? ​

How do I remove duplicate rows in OpenRefine? ​

Does clustering merge rows or just edit values? ​

Can I trust the merged value OpenRefine suggests? ​

Related reading ​