Appearance
When linking entities to Wikidata goes wrong, the symptom is almost always one of four things: a confident link to the wrong item, no candidates returned, silent rate-limiting, or links that look fine but fail on type and date. The fix in nearly every case is to stop ranking on name similarity alone and add structured constraints — type, time, and place — to your candidate query. This page is a diagnostic walkthrough for each failure mode.
Wikidata linking is seductive because the easy cases work instantly. The danger is the easy cases lull you into trusting the hard ones, where a 1640s clothier quietly links to a modern namesake.
Why did my entity link to the wrong item?
The classic root cause: the matcher scored candidates on label string similarity and picked the most prominent match, not the correct one. Prominence (sitelinks, statements) correlates with fame, not with your obscure subject.
The fix is to constrain candidates structurally. In a SPARQL candidate query, require the right type and a plausible lifespan:
sparql
SELECT ?item ?itemLabel WHERE {
?item rdfs:label "John Wright"@en ;
wdt:P31 wd:Q5 . # instance of human
OPTIONAL { ?item wdt:P569 ?birth. } # date of birth
FILTER(!BOUND(?birth) || YEAR(?birth) < 1700)
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}If two candidates survive the constraints, that is a genuine disambiguation problem — escalate to human review rather than auto-picking.
Why are no candidates coming back?
Walk these causes in order:
| Symptom | Likely cause | Fix |
|---|---|---|
| Empty result, valid name | Type filter too tight | Drop wdt:P31 temporarily |
| Empty for diacritic names | Exact label mismatch | Normalise/fold the string, try skos:altLabel |
| Error mentioning a property | Wrong P-number | Verify the property id on Wikidata |
| Works in browser, not in code | Missing User-Agent | Set a descriptive header |
Archaic spelling is a frequent culprit. "Shakspere" will not match a label of "William Shakespeare"; query against skos:altLabel as well as rdfs:label to catch variants.
How do I stop getting rate-limited?
Wikidata's endpoints will throttle aggressive clients. Symptoms are HTTP 429s or sudden timeouts mid-batch. Mitigations, in order of impact:
python
import requests, time
HEADERS = {"User-Agent": "ProsopographyProject/1.0 ([email protected])"}
def reconcile(name, cache={}):
if name in cache:
return cache[name]
r = requests.get(WDQS, params={"query": q(name), "format": "json"},
headers=HEADERS, timeout=30)
time.sleep(0.5) # be polite
cache[name] = r.json()
return cache[name]Caching is the biggest win: a corpus repeats names constantly, so memoising turns thousands of calls into hundreds. For large jobs, prefer the OpenRefine Wikidata reconciliation service, which batches and respects limits for you.
Should I store QIDs or labels?
Always store the QID. A QID like Q5598 is a permanent identifier; the English label can be renamed, merged, or translated tomorrow. Keep the label alongside purely so humans can read the record, and treat it as a cache that may go stale.
What if the right item simply does not exist?
This is the most common situation in archival work and the one people handle worst. The overwhelming majority of ordinary historical people, minor places, and local institutions are not in Wikidata. The correct response is to mint a local stable identifier and leave the Wikidata field null — not to attach the nearest plausible QID. A wrong link is harder to detect and undo than an honest blank.
How do I validate a batch of links?
Pull a random sample of at least 50 links and open each Wikidata item. For every one, confirm three facts agree with your entity: type (human, place, organisation), dates, and place. Record precision. For historical persons, aim above 0.95 — wrong links propagate into every downstream query and are rarely caught later. Where you are unsure, downgrade to a "candidate" status rather than a firm link.
Key Takeaways
- Wrong-item links come from name-only ranking; add type, date, and place constraints.
- Empty candidates usually mean a too-tight filter or an unnormalised string.
- Cache and throttle to avoid rate limits; reuse the OpenRefine reconciler.
- Store the QID as the link; keep the label only as a convenience.
- If no correct item exists, leave the link blank and mint a local id.
- Validate batches by sampling and checking type, dates, and place.
- A wrong link is worse than no link — escalate ambiguity to humans.
Frequently Asked Questions
Why does my entity match the wrong Wikidata item?
Usually because the matcher ranked on label similarity alone and picked a famous namesake. Add type and date constraints to the candidate query so a 17th-century weaver cannot match a modern footballer.
What do I do when a person has no Wikidata item at all?
Most archival people do not. Leave the link empty and mint a local identifier instead. Do not force a link to a vaguely similar item — a wrong link is worse than no link.
Why is my reconciliation returning no candidates?
Common causes are an over-restrictive type filter, a misspelled property id, or an entity name with diacritics or archaic spelling. Relax the type, normalise the string, and retry.
How do I avoid getting rate-limited by the Wikidata API?
Batch your queries, add a small delay, set a descriptive User-Agent header, and prefer the reconciliation service or SPARQL endpoint over many single-item calls. Cache results so you never ask twice.
Should I store the QID or the label?
Store the QID (e.g. Q42) as the stable link and keep the label only as a human-readable convenience. Labels change; QIDs are persistent identifiers.
How do I check a batch of links is correct?
Sample at least 50 links, open each item, and confirm type, dates, and place match your entity. Track precision; for historical people anything below about 0.95 needs review because wrong links propagate silently.