Appearance
Managing Wikidata data quality is mostly a troubleshooting loop: detect the defect, trace it to a root cause, fix it in a way that holds, and add a guard so it does not recur. The five problems that account for most heritage-data pain are duplicate items, constraint violations, missing references, wrong-property statements, and ambiguous links. This guide walks each from symptom to durable fix.
Why do duplicate items appear and how do I merge them?
Duplicates usually come from importing without reconciliation: a new item is created for a person who already exists. Symptoms are split statement counts and two Q-numbers in search.
To diagnose, query for likely twins:
sparql
SELECT ?a ?aLabel ?b ?bLabel WHERE {
?a wdt:P31 wd:Q5 ; rdfs:label ?name .
?b wdt:P31 wd:Q5 ; rdfs:label ?name .
FILTER(?a != ?b && STR(?a) < STR(?b))
FILTER(LANG(?name) = "en")
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} LIMIT 50Fix with the Merge gadget (enable it in Preferences → Gadgets): it moves statements onto the lower QID and leaves a redirect. The durable guard is to reconcile against Wikidata before every import.
How do I read and clear constraint violations?
Each property carries constraints (single-value, value-type, format, allowed-qualifiers). Violations show as a coloured icon next to the statement. Root causes split cleanly:
| Violation type | Typical cause | Fix |
|---|---|---|
| Value-type | linked to the wrong class of item | repoint the value to a correctly typed Q-item |
| Format | external ID typed by hand | correct the ID against the source register |
| Single-value | a true second value, or a duplicate | mark P:reason for preferred rank, or delete the dupe |
| Mandatory-qualifier | missing date or determination method | add the qualifier |
Pull a bulk list from the per-property constraint report rather than checking items one at a time.
What do I do about missing references?
A heritage statement without a reference is unverifiable. Diagnose by querying statements that lack a prov:wasDerivedFrom reference node, then add P248 (stated in) or a reference URL. The lasting fix is to write references at import time in the same QuickStatements row, never as a later pass that quietly never happens:
text
Q12345|P569|+1788-03-12T00:00:00Z/11|S248|Q5375741|S813|+2025-01-04T00:00:00Z/11Here S248 attaches the source and S813 records when it was retrieved.
Why is wrong-property data so easy to introduce?
Heritage editors often reach for a familiar property when a more precise one exists — for example using P276 (location) where P195 (collection) is correct, or P170 (creator) where P50 (author) fits. The symptom is queries that return too few or too many rows. Diagnose by sampling: list items with the property and eyeball whether the values make sense. Fix by moving the values to the right property, and prevent recurrence by writing a short property map for your project.
How do I keep ambiguous links from poisoning queries?
An item linked to the wrong namesake is silent and corrosive. The defence is reconciliation discipline: match on more than the label — birth year, occupation, an external ID like VIAF or ULAN. When confidence is low, leave it unlinked rather than guess; an empty cell is honest, a wrong link is not.
Can I batch-fix without breaking things?
Yes, but make every batch reversible. Before running QuickStatements or an OpenRefine schema, export the QIDs and current values you intend to change. Run on five items, verify in the live UI, then scale. If a batch goes wrong, the recorded QIDs let you reverse it precisely instead of trawling your contributions.
Key Takeaways
- The quality loop is detect → root cause → durable fix → add a guard.
- Duplicates are the costliest defect; prevent them by reconciling before import.
- Clear constraint violations in bulk from per-property reports, not item by item.
- Write references at import time (
P248/S813), never as an optional later pass. - Use the right property; keep a short project property map to stop drift.
- Prefer an honest empty cell to a confident wrong link.
- Make every batch reversible by recording QIDs and prior values first.
Frequently Asked Questions
What does data quality mean on Wikidata specifically?
On Wikidata, quality means each statement is correct, sourced with a reference, uses the right property and datatype, and does not violate the property's constraints. For heritage data it also means stable external identifiers and unambiguous links to the right Q-item.
How do I find constraint violations on an item?
Open the item and look for the small constraint-violation icons next to statements, or run a query against the constraint-violation reports. The Wikidata Query Service and the KrBot / constraint report pages list violations per property in bulk.
Why are duplicate items the worst quality problem?
Duplicates split statements, references and external IDs across two Q-numbers, so queries undercount and reconciliation breaks. They are also slow to fix because every inbound link must be redirected, which is why prevention via reconciliation beats cleanup.
Can I fix many quality issues at once?
Yes. QuickStatements applies batched additions and corrections from a TSV, and OpenRefine can reconcile and patch in bulk. Always test on a handful of items first and keep the batch reversible by recording the QIDs you touched.
How do I stop bad data getting in again?
Add property constraints, write references at import time rather than later, validate source spreadsheets before upload, and reconcile to existing items so you do not create duplicates. Prevention at the import stage is far cheaper than downstream repair.
Are missing references a quality problem or just a nicety?
They are a real quality problem. An unsourced heritage claim cannot be verified or trusted, and reusers will discard it. Every non-trivial statement should carry a reference (P248 stated in, or a reference URL) pointing to the source.