Skip to content
Data Cleaning with OpenRefine

To reconcile data to Wikidata reliably in OpenRefine, add the built-in Wikidata reconciliation service, set the correct entity type (for example human, Q5), supply supporting properties to disambiguate, and review every match below the confidence threshold before accepting it. The difference between a defensible result and a noisy one is almost entirely about type constraints and evidence, not the matching algorithm.

What reconciliation does and why it matters

Reconciliation links your free-text values to stable identifiers — Wikidata QIDs like Q937 for Albert Einstein. Once a column is reconciled, you can pull in birth dates, coordinates or VIAF IDs automatically, and your data becomes part of the linked-data graph rather than an island of strings. For historians this turns a name list into a queryable, citable resource.

How do you set up the Wikidata service?

In OpenRefine 3.x the Wikibase service is built in. On the column you want to reconcile, choose Reconcile > Start reconciling. If Wikidata is not yet listed, click Add Standard Service and paste the Wikidata reconciliation endpoint, then select it. OpenRefine samples your values and suggests candidate types automatically — but do not accept the suggestion blindly.

Why is setting the correct type the most important step?

A type constraint restricts candidates to one class of entity. Searching "Cambridge" with no type returns a city, a university, a band and a dictionary publisher. Constrain to city (Q515) and the noise vanishes.

text
Without type:   Cambridge -> 9 candidates, top score 71
With type Q515: Cambridge -> 2 candidates, top score 94

Always pick the narrowest correct type. For people use human (Q5); for settlements consider human settlement (Q486972) if villages and towns are mixed.

Adding properties to disambiguate

Properties are the evidence that separates two people with the same name. In the reconciliation dialog, map your other columns to Wikidata properties:

  • A birth year column to date of birth (P569).
  • A country column to country (P17).
  • An occupation column to occupation (P106).

Each mapped property raises the score of matching candidates and demotes the rest. In practice, adding one strong property such as date of birth can lift auto-match rates from roughly 50% to over 80% on a person dataset.

A reconciliation quality checklist

StepCheckWhy
TypeNarrowest correct class setRemoves wrong-class collisions
PropertiesAt least one disambiguator mappedSeparates same-name entities
Sample review20+ matches inspected by handCatches systematic errors early
ThresholdBelow-threshold matches reviewedAvoids false auto-matches
SnapshotQIDs stored as textReproducible if Wikidata changes
New entitiesGenuinely missing items flaggedDistinguishes "no match" from "wrong match"

How should you handle the matches afterwards?

Use the judgment facet: Reconcile > Facet > By judgment. Filter to none to see unmatched rows and to matched to verify accepted ones. Accept a candidate from the cell's pop-up, or use Match each cell to its best candidate only after you have confirmed the threshold is safe for this dataset — never as a reflex on historical names.

Locking in reproducibility

Reconciliation state is live, so capture your results as plain data. Add a new column with:

text
GREL: cell.recon.match.id

This writes the matched QID as text. Record the date, the service version and your type/property settings in the project README. Wikidata items get merged and edited constantly; a snapshot of QIDs plus a date is what keeps your dataset citable a year from now.

Key Takeaways

  • The Wikidata reconciliation service is built into OpenRefine 3.x — no extension required.
  • Setting the narrowest correct entity type is the highest-impact step for precision.
  • Map supporting properties (date of birth, country, occupation) to disambiguate same-name entities.
  • Review every below-threshold match by hand, especially for historical people and places.
  • Store matched QIDs as plain text with cell.recon.match.id so results survive Wikidata edits.
  • Document type, properties and the reconciliation date for a reproducible, defensible result.

Frequently Asked Questions

Do I need an extension to reconcile against Wikidata?

No. Modern OpenRefine 3.x ships with the Wikibase reconciliation service built in. You simply add the Wikidata endpoint as a reconciliation service the first time and it is then always available.

How do I improve reconciliation match rates?

Add supporting properties such as a type constraint, a date of birth or a country before reconciling. These properties give the service extra evidence and sharply reduce ambiguous candidates, often lifting auto-match rates from around half to over eighty percent.

Should I auto-match high-scoring candidates?

Auto-match only candidates the service marks as confident, and only after spot-checking a sample. For historical people and places, always review matches scoring below the auto-match threshold manually, because name collisions are common.

What is a reconciliation type and why set one?

A reconciliation type constrains candidates to a class such as human (Q5) or city (Q515). Setting the correct type filters out irrelevant entities of the same name and is the single most effective way to raise precision.

How do I record which Wikidata IDs I matched?

After reconciling, add a column from the reconciled column using the GREL expression cell.recon.match.id to store the QID as plain text, then export it. This preserves your decisions independently of the live reconciliation state.

Can reconciliation results change over time?

Yes, because Wikidata is edited continuously. Always snapshot the matched QIDs and the reconciliation date so your dataset remains reproducible even if entities are later merged, renamed or deleted.