Troubleshooting: Link entities to Wikidata

Q: Should I store the QID or the label?

Store the QID (e.g. Q42) as the stable link and keep the label only as a human-readable convenience. Labels change; QIDs are persistent identifiers.

When linking entities to Wikidata goes wrong, the symptom is almost always one of four things: a confident link to the wrong item, no candidates returned, silent rate-limiting, or links that look fine but fail on type and date. The fix in nearly every case is to stop ranking on name similarity alone and add structured constraints — type, time, and place — to your candidate query. This page is a diagnostic walkthrough for each failure mode.

Wikidata linking is seductive because the easy cases work instantly. The danger is the easy cases lull you into trusting the hard ones, where a 1640s clothier quietly links to a modern namesake.

Why did my entity link to the wrong item?

The classic root cause: the matcher scored candidates on label string similarity and picked the most prominent match, not the correct one. Prominence (sitelinks, statements) correlates with fame, not with your obscure subject.

The fix is to constrain candidates structurally. In a SPARQL candidate query, require the right type and a plausible lifespan:

sparql

SELECT ?item ?itemLabel WHERE {
  ?item rdfs:label "John Wright"@en ;
        wdt:P31 wd:Q5 .                 # instance of human
  OPTIONAL { ?item wdt:P569 ?birth. }   # date of birth
  FILTER(!BOUND(?birth) || YEAR(?birth) < 1700)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

If two candidates survive the constraints, that is a genuine disambiguation problem — escalate to human review rather than auto-picking.

Why are no candidates coming back?

Walk these causes in order:

Symptom	Likely cause	Fix
Empty result, valid name	Type filter too tight	Drop `wdt:P31` temporarily
Empty for diacritic names	Exact label mismatch	Normalise/fold the string, try `skos:altLabel`
Error mentioning a property	Wrong P-number	Verify the property id on Wikidata
Works in browser, not in code	Missing User-Agent	Set a descriptive header

Archaic spelling is a frequent culprit. "Shakspere" will not match a label of "William Shakespeare"; query against skos:altLabel as well as rdfs:label to catch variants.

How do I stop getting rate-limited?

Wikidata's endpoints will throttle aggressive clients. Symptoms are HTTP 429s or sudden timeouts mid-batch. Mitigations, in order of impact:

python

import requests, time

HEADERS = {"User-Agent": "ProsopographyProject/1.0 ([email protected])"}

def reconcile(name, cache={}):
    if name in cache:
        return cache[name]
    r = requests.get(WDQS, params={"query": q(name), "format": "json"},
                     headers=HEADERS, timeout=30)
    time.sleep(0.5)            # be polite
    cache[name] = r.json()
    return cache[name]

Caching is the biggest win: a corpus repeats names constantly, so memoising turns thousands of calls into hundreds. For large jobs, prefer the OpenRefine Wikidata reconciliation service, which batches and respects limits for you.

Should I store QIDs or labels?

Always store the QID. A QID like Q5598 is a permanent identifier; the English label can be renamed, merged, or translated tomorrow. Keep the label alongside purely so humans can read the record, and treat it as a cache that may go stale.

What if the right item simply does not exist?

This is the most common situation in archival work and the one people handle worst. The overwhelming majority of ordinary historical people, minor places, and local institutions are not in Wikidata. The correct response is to mint a local stable identifier and leave the Wikidata field null — not to attach the nearest plausible QID. A wrong link is harder to detect and undo than an honest blank.

How do I validate a batch of links?

Pull a random sample of at least 50 links and open each Wikidata item. For every one, confirm three facts agree with your entity: type (human, place, organisation), dates, and place. Record precision. For historical persons, aim above 0.95 — wrong links propagate into every downstream query and are rarely caught later. Where you are unsure, downgrade to a "candidate" status rather than a firm link.

Key Takeaways

Wrong-item links come from name-only ranking; add type, date, and place constraints.
Empty candidates usually mean a too-tight filter or an unnormalised string.
Cache and throttle to avoid rate limits; reuse the OpenRefine reconciler.
Store the QID as the link; keep the label only as a convenience.
If no correct item exists, leave the link blank and mint a local id.
Validate batches by sampling and checking type, dates, and place.
A wrong link is worse than no link — escalate ambiguity to humans.

Frequently Asked Questions

Why does my entity match the wrong Wikidata item?

Usually because the matcher ranked on label similarity alone and picked a famous namesake. Add type and date constraints to the candidate query so a 17th-century weaver cannot match a modern footballer.

What do I do when a person has no Wikidata item at all?

Most archival people do not. Leave the link empty and mint a local identifier instead. Do not force a link to a vaguely similar item — a wrong link is worse than no link.

Why is my reconciliation returning no candidates?

Common causes are an over-restrictive type filter, a misspelled property id, or an entity name with diacritics or archaic spelling. Relax the type, normalise the string, and retry.

How do I avoid getting rate-limited by the Wikidata API?

Batch your queries, add a small delay, set a descriptive User-Agent header, and prefer the reconciliation service or SPARQL endpoint over many single-item calls. Cache results so you never ask twice.

Should I store the QID or the label?

Store the QID (e.g. Q42) as the stable link and keep the label only as a human-readable convenience. Labels change; QIDs are persistent identifiers.

How do I check a batch of links is correct?

Sample at least 50 links, open each item, and confirm type, dates, and place match your entity. Track precision; for historical people anything below about 0.95 needs review because wrong links propagate silently.

Why did my entity link to the wrong item? ​

Why are no candidates coming back? ​

How do I stop getting rate-limited? ​

Should I store QIDs or labels? ​

What if the right item simply does not exist? ​

How do I validate a batch of links? ​

Key Takeaways ​

Frequently Asked Questions ​

Why does my entity match the wrong Wikidata item? ​

What do I do when a person has no Wikidata item at all? ​

Why is my reconciliation returning no candidates? ​

How do I avoid getting rate-limited by the Wikidata API? ​

Should I store the QID or the label? ​

How do I check a batch of links is correct? ​

Related reading ​