Resolve entities across records: A Practical Guide

Resolving entities across records means deciding which mentions in different sources refer to the same real person, place or organisation, then binding them under one stable identifier — despite variant spellings, changed names and missing data. The practical workflow is four steps: normalise lightly, block to make comparison tractable, score candidate pairs on multiple attributes, then decide with a confidence band that routes the uncertain middle to human review. This guide runs that workflow with concrete examples a historian or archivist can apply directly.

Why isn't exact name matching enough?

Historical names are unstable. Marie/Mary/Maria, Smith/Smyth/Smythe, married-name changes, and the sheer recurrence of common names mean exact matching fails both ways: it misses true links and merges distinct people who happen to share a name. Resolution has to reason over more than the surface string.

What does the end-to-end workflow look like?

text

1. Normalise   light, reversible cleaning (case, punctuation, abbreviations)
2. Block       group into candidate buckets to avoid all-pairs comparison
3. Compare     score each candidate pair across name + dates + place + relations
4. Decide      confidence band: auto-accept / review / auto-reject
5. Assign      mint a stable ID; link out to an authority where confident

Each step is cheap on its own; the discipline is keeping them separate and auditable.

How does blocking make this tractable?

Comparing every pair of records is quadratic: a 10,000-record dataset is ~50 million comparisons. Blocking groups records into buckets so you only compare plausible candidates — for example, the same surname Soundex and the same birth decade.

python

import jellyfish

def block_key(rec):
    return (jellyfish.soundex(rec["surname"]),
            (rec["birth_year"] // 10) * 10 if rec["birth_year"] else None)

blocks = {}
for rec in records:
    blocks.setdefault(block_key(rec), []).append(rec)
# now compare only pairs within the same block

Choose a blocking key loose enough not to lose true matches but tight enough to cut the comparison count by orders of magnitude.

How do you score a candidate pair?

Never rely on the name alone. Combine evidence from several attributes into a single score:

Attribute	Comparison	Weight
Surname	Jaro-Winkler similarity	high
Forename	Jaro-Winkler + nickname map	medium
Birth/death dates	year proximity	high
Place	gazetteer-resolved match	medium
Relations	shared parent/spouse	strong corroboration

python

import jellyfish

def pair_score(a, b):
    name = jellyfish.jaro_winkler_similarity(a["surname"], b["surname"])
    date = 1.0 if a["birth_year"] == b["birth_year"] else 0.0
    place = 1.0 if a["place_id"] == b["place_id"] else 0.0
    return 0.5 * name + 0.3 * date + 0.2 * place

How do you decide without forcing a binary?

Use a confidence band, not a single threshold. Auto-accept pairs above a high score, auto-reject below a low one, and send the ambiguous middle to a human with the evidence attached.

python

def classify(score):
    if score >= 0.90: return "match"
    if score <= 0.55: return "no-match"
    return "review"        # human decides, evidence shown

This keeps the costly human attention on exactly the cases that need judgement.

Local identifiers or an external authority?

Do both. Mint your own stable local IDs as the backbone — they keep you in control and survive even if an external resource changes. Then link confident matches out to Wikidata or VIAF for reuse and extra disambiguation power.

json

{"local_id": "per-00417",
 "preferred_name": "Mary Astell",
 "variants": ["Mrs Astell", "Astell, Mary"],
 "wikidata": "Q269055",
 "match_status": "match", "confidence": 0.94}

How do you avoid wrongly merging two people?

A false merge is far harder to detect and undo than a missed link, so bias toward caution. Require corroborating attributes beyond the name, and manually review any merge that rests on the name alone. Keep merges reversible by storing the source mentions that justified each one, so a later reviewer can split them if the evidence does not hold.

Key Takeaways

Exact name matching fails both ways on historical data; resolve on multiple attributes.
Run the workflow as distinct, auditable steps: normalise, block, compare, decide, assign.
Blocking turns an infeasible all-pairs comparison into a tractable one — choose the key carefully.
Score pairs across name, dates, place and relations, not name alone.
Use a confidence band so humans review only the ambiguous middle.
Mint stable local identifiers and link out to Wikidata or VIAF where confident.
Bias toward caution on merges and keep them reversible; a false merge is costlier than a missed link.

Frequently Asked Questions

What is entity resolution across historical records?

It is deciding which mentions in different records refer to the same real person, place or organisation, despite spelling variants, name changes and missing data, then assigning a stable identifier that ties them together.

Why can't I just match on exact names?

Historical names are unstable: spelling varies, women's surnames change, abbreviations abound and the same name recurs across people. Exact matching both misses true links (false negatives) and merges distinct people who share a name (false positives).

What is blocking and why do I need it?

Blocking groups records into candidate buckets (for example by Soundex of the surname and birth decade) so you only compare plausibly-matching pairs. Without it, comparing every pair is quadratic and infeasible beyond a few thousand records.

Should I link to an external authority or build my own identifiers?

Both. Mint your own stable local identifiers as the backbone of your dataset, and link out to Wikidata or VIAF where a confident match exists. Local IDs keep you in control; external links add reuse and disambiguation power.

How do I handle uncertain matches?

Keep a confidence score and a match-status field rather than forcing a binary decision. Auto-accept high-confidence pairs, auto-reject low ones, and route the ambiguous middle band to human review with the evidence attached.

How do I avoid wrongly merging two different people?

Require corroborating attributes beyond the name, such as dates, places or relations, and review merges that rest on name alone. A bad merge is harder to detect and undo later than a missed link, so bias the threshold toward caution.

Why isn't exact name matching enough? ​

What does the end-to-end workflow look like? ​

How does blocking make this tractable? ​

How do you score a candidate pair? ​

How do you decide without forcing a binary? ​

Local identifiers or an external authority? ​

How do you avoid wrongly merging two people? ​

Key Takeaways ​

Frequently Asked Questions ​

What is entity resolution across historical records? ​

Why can't I just match on exact names? ​

What is blocking and why do I need it? ​

Should I link to an external authority or build my own identifiers? ​

How do I handle uncertain matches? ​

How do I avoid wrongly merging two different people? ​

Related reading ​

Why isn't exact name matching enough?

What does the end-to-end workflow look like?

How does blocking make this tractable?

How do you score a candidate pair?

How do you decide without forcing a binary?

Local identifiers or an external authority?

How do you avoid wrongly merging two people?

Key Takeaways

Frequently Asked Questions

What is entity resolution across historical records?

Why can't I just match on exact names?

What is blocking and why do I need it?

Should I link to an external authority or build my own identifiers?

How do I handle uncertain matches?

How do I avoid wrongly merging two different people?

Related reading