Skip to content
Linked Open Data

When reconciliation goes wrong it is almost always one of four causes: an unreachable or misconfigured service, ambiguous source strings with no disambiguating constraint, a mismatch between your data's time period and the target's present-day worldview, or storing labels instead of stable URIs. Diagnose by isolating one record and testing it by hand against the reconciliation API; the failure mode you reproduce there tells you which fix applies. Below are the recurring problems and the fixes that actually hold.

Why are there zero candidates for everything?

A total blank is an infrastructure problem, not a data problem. Walk this checklist:

  1. Open the service manifest URL directly in a browser. A reconciliation endpoint must return JSON describing its name and identifier space.
  2. Confirm https vs http and no stray trailing characters.
  3. Check for rate limiting; public Wikidata reconciliation throttles aggressively under load.
  4. Test one query by hand:
bash
curl -s "https://wikidata.reconci.link/en/api" \
  --data-urlencode 'queries={"q0":{"query":"Vermeer","type":"Q5"}}' | jq .

If that returns candidates but OpenRefine does not, the problem is your column type or service registration inside the tool, not the network.

Why does it confidently pick the wrong entity?

High-confidence wrong matches come from ambiguous strings reconciled without constraints. "John Smith" has thousands of candidates; the service returns the most prominent, not the right one. Fix it with two levers:

  • A type constraint: restrict to Q5 (human), Q3957 (town), and so on.
  • Property mappings: feed an additional column as evidence.
text
Column "name"      -> primary query
Column "born"      -> property P569 (date of birth)
Column "role"      -> property P106 (occupation)

With a birth year attached, two homonymous painters separate cleanly and the correct candidate jumps to the top.

How do I reconcile historical places that have changed or vanished?

Reconciling a 1648 settlement against a present-day place database fails because the name, the polity, and sometimes the location have all changed. Use a date-aware source.

SourceStrengthWatch out for
World Historical GazetteerPeriod-aware, variant namesCoverage uneven by region
GeoNamesHuge, alternate-name tablePresent-day bias
PleiadesAncient world authorityClassical scope only
WikidataLinks onward to everythingModern entity per place

Match on historical variants, store the resolved URI and the asserted period, so a later reader knows the link is time-bounded.

Should I trust the auto-match score?

No blanket threshold is safe. The reliable rule: auto-accept only where the service returns exactly one candidate flagged as an exact match; queue everything else for human judgement. A facet in OpenRefine makes this fast.

text
Facet by judgement -> "none"      : needs review
Facet by candidate count -> ">1"  : ambiguous, review

Confident-but-wrong matches are the most expensive errors because nothing downstream flags them.

Why did good matches rot after a few months?

Two root causes. First, you may have stored the label ("Rembrandt") instead of the URI; labels are not identifiers. Second, the target entity was merged or deleted upstream. Defences:

  • Persist the canonical URI in its own column at reconciliation time.
  • Re-run a validation pass that fetches each URI and checks for 301/redirect or HTTP 410.
python
import requests
def still_live(uri):
    r = requests.head(uri, allow_redirects=False, timeout=10)
    return r.status_code, r.headers.get("location")

A merge shows up as a redirect; act on it by updating to the surviving URI.

Reconciliation is painfully slow. What helps?

Most slowness is wasted work. If 200,000 rows contain only 4,000 distinct values, reconcile the distinct set, then join back. Practical steps:

  1. Cluster and dedupe the column first.
  2. Reconcile the unique values only.
  3. Increase batch size if the service allows it.
  4. Cache the value-to-URI map so reruns are instant.

Key Takeaways

  • Isolate one record and hit the API by hand; the reproduced failure names the cause.
  • Zero candidates everywhere is almost always a service URL, protocol, or rate-limit issue.
  • Cure confident-wrong matches with type constraints plus disambiguating property columns.
  • Reconcile historical places against date-aware gazetteers and record the period.
  • Auto-match only single exact candidates; review the rest by hand.
  • Store stable URIs, never labels, and revalidate to catch merges and deletions.
  • Deduplicate before reconciling to cut wasted API calls dramatically.

Frequently Asked Questions

Why does OpenRefine match a person to the wrong entity with high confidence?

Usually because the name string is ambiguous and no type or property constraint is applied. Add a type filter (e.g. human) and supply a disambiguating property like a birth year or occupation column so the reconciliation service ranks the right candidate first.

My reconciliation service returns zero candidates for everything. What is wrong?

Check the endpoint URL and that the service manifest loads in a browser; a trailing slash, an http/https mismatch, or a CORS or rate-limit block will silently zero out results. Test one known value by hand against the API before blaming your data.

How do I reconcile place names that no longer exist?

Reconcile against a historical gazetteer such as the World Historical Gazetteer or GeoNames with a date-aware query rather than a present-day place database. Match on historical name variants and record the period so a renamed or vanished settlement still resolves.

Should I auto-match everything above a score threshold?

No. Auto-match only exact, unambiguous single candidates, and review the rest manually. A blanket threshold silently accepts confident-but-wrong matches, which are the hardest errors to find later.

Why do my matched IDs break a few months after reconciliation?

Either the target entity was merged or deleted, or you stored a label instead of a stable URI. Always persist the canonical URI, not the display name, and periodically revalidate against the source to catch merges and redirects.

Reconciliation is extremely slow on a large file. How do I speed it up?

Batch and deduplicate first: reconcile distinct values only, raise the service's batch size if allowed, and cache results locally. Reconciling 200,000 rows where only 4,000 values are unique wastes most of the calls.