Appearance
Resolving entities across records means deciding which mentions in different sources refer to the same real person, place or organisation, then binding them under one stable identifier — despite variant spellings, changed names and missing data. The practical workflow is four steps: normalise lightly, block to make comparison tractable, score candidate pairs on multiple attributes, then decide with a confidence band that routes the uncertain middle to human review. This guide runs that workflow with concrete examples a historian or archivist can apply directly.
Why isn't exact name matching enough?
Historical names are unstable. Marie/Mary/Maria, Smith/Smyth/Smythe, married-name changes, and the sheer recurrence of common names mean exact matching fails both ways: it misses true links and merges distinct people who happen to share a name. Resolution has to reason over more than the surface string.
What does the end-to-end workflow look like?
text
1. Normalise light, reversible cleaning (case, punctuation, abbreviations)
2. Block group into candidate buckets to avoid all-pairs comparison
3. Compare score each candidate pair across name + dates + place + relations
4. Decide confidence band: auto-accept / review / auto-reject
5. Assign mint a stable ID; link out to an authority where confidentEach step is cheap on its own; the discipline is keeping them separate and auditable.
How does blocking make this tractable?
Comparing every pair of records is quadratic: a 10,000-record dataset is ~50 million comparisons. Blocking groups records into buckets so you only compare plausible candidates — for example, the same surname Soundex and the same birth decade.
python
import jellyfish
def block_key(rec):
return (jellyfish.soundex(rec["surname"]),
(rec["birth_year"] // 10) * 10 if rec["birth_year"] else None)
blocks = {}
for rec in records:
blocks.setdefault(block_key(rec), []).append(rec)
# now compare only pairs within the same blockChoose a blocking key loose enough not to lose true matches but tight enough to cut the comparison count by orders of magnitude.
How do you score a candidate pair?
Never rely on the name alone. Combine evidence from several attributes into a single score:
| Attribute | Comparison | Weight |
|---|---|---|
| Surname | Jaro-Winkler similarity | high |
| Forename | Jaro-Winkler + nickname map | medium |
| Birth/death dates | year proximity | high |
| Place | gazetteer-resolved match | medium |
| Relations | shared parent/spouse | strong corroboration |
python
import jellyfish
def pair_score(a, b):
name = jellyfish.jaro_winkler_similarity(a["surname"], b["surname"])
date = 1.0 if a["birth_year"] == b["birth_year"] else 0.0
place = 1.0 if a["place_id"] == b["place_id"] else 0.0
return 0.5 * name + 0.3 * date + 0.2 * placeHow do you decide without forcing a binary?
Use a confidence band, not a single threshold. Auto-accept pairs above a high score, auto-reject below a low one, and send the ambiguous middle to a human with the evidence attached.
python
def classify(score):
if score >= 0.90: return "match"
if score <= 0.55: return "no-match"
return "review" # human decides, evidence shownThis keeps the costly human attention on exactly the cases that need judgement.
Local identifiers or an external authority?
Do both. Mint your own stable local IDs as the backbone — they keep you in control and survive even if an external resource changes. Then link confident matches out to Wikidata or VIAF for reuse and extra disambiguation power.
json
{"local_id": "per-00417",
"preferred_name": "Mary Astell",
"variants": ["Mrs Astell", "Astell, Mary"],
"wikidata": "Q269055",
"match_status": "match", "confidence": 0.94}How do you avoid wrongly merging two people?
A false merge is far harder to detect and undo than a missed link, so bias toward caution. Require corroborating attributes beyond the name, and manually review any merge that rests on the name alone. Keep merges reversible by storing the source mentions that justified each one, so a later reviewer can split them if the evidence does not hold.
Key Takeaways
- Exact name matching fails both ways on historical data; resolve on multiple attributes.
- Run the workflow as distinct, auditable steps: normalise, block, compare, decide, assign.
- Blocking turns an infeasible all-pairs comparison into a tractable one — choose the key carefully.
- Score pairs across name, dates, place and relations, not name alone.
- Use a confidence band so humans review only the ambiguous middle.
- Mint stable local identifiers and link out to Wikidata or VIAF where confident.
- Bias toward caution on merges and keep them reversible; a false merge is costlier than a missed link.
Frequently Asked Questions
What is entity resolution across historical records?
It is deciding which mentions in different records refer to the same real person, place or organisation, despite spelling variants, name changes and missing data, then assigning a stable identifier that ties them together.
Why can't I just match on exact names?
Historical names are unstable: spelling varies, women's surnames change, abbreviations abound and the same name recurs across people. Exact matching both misses true links (false negatives) and merges distinct people who share a name (false positives).
What is blocking and why do I need it?
Blocking groups records into candidate buckets (for example by Soundex of the surname and birth decade) so you only compare plausibly-matching pairs. Without it, comparing every pair is quadratic and infeasible beyond a few thousand records.
Should I link to an external authority or build my own identifiers?
Both. Mint your own stable local identifiers as the backbone of your dataset, and link out to Wikidata or VIAF where a confident match exists. Local IDs keep you in control; external links add reuse and disambiguation power.
How do I handle uncertain matches?
Keep a confidence score and a match-status field rather than forcing a binary decision. Auto-accept high-confidence pairs, auto-reject low ones, and route the ambiguous middle band to human review with the evidence attached.
How do I avoid wrongly merging two different people?
Require corroborating attributes beyond the name, such as dates, places or relations, and review merges that rest on name alone. A bad merge is harder to detect and undo later than a missed link, so bias the threshold toward caution.