Disambiguate historical people: A Practical Guide

Disambiguating historical people means turning a pile of name mentions into a set of person records, deciding for each pair of mentions whether they are the same individual. The reliable workflow is: block candidates by surname, score each pair on contextual evidence (dates, places, relations), auto-merge the confident matches, and route the rest to human review with documented reasons. Done well, this is what lets you say "this John Wright" rather than "a John Wright" in your scholarship.

This is the step most projects underestimate. Extraction is fast; deciding that three "Mary Coke" mentions are two women, not one, is slow, evidential, and where the historical argument lives.

Why can't I just merge identical names?

Because name identity and person identity are different things. Two records reading "Thomas Clark" may be father and son, or unrelated namesakes a county apart. Conversely, one person appears as "Cromwell", "O. C.", and "the Lord Protector". Merging on string equality alone produces both false merges and false splits — the two errors you are trying to balance.

How do I structure the disambiguation workflow?

Use a four-stage pipeline:

Block — group mentions that could be the same, usually by normalised surname or a phonetic key, so you never compare every pair against every other.
Score — for each within-block pair, compute a similarity score from several features.
Cluster — merge pairs above a high threshold automatically; flag the middle band.
Review — a human resolves flagged clusters and records the decision.

python

import jellyfish

def block_key(name):
    surname = name.split()[-1].lower()
    return jellyfish.metaphone(surname)

# mentions sharing a block_key are candidate matches

Blocking turns an O(n²) problem into something tractable: a corpus with 5,000 mentions becomes a few hundred small comparison sets.

What features actually separate two same-named people?

Score pairs on evidence, not spelling alone:

Feature	Why it helps	Example signal
Date overlap	People have lifespans	active 1610 vs active 1690
Place	Most people stay regional	York parish vs Bristol parish
Occupation/title	Stable identity marker	"draper" vs "mariner"
Kin terms	Strong personal anchor	"wife Anne", "son of Robert"
Co-mentions	Social networks repeat	appears beside the same names

A weighted sum of these, calibrated on a hand-labelled sample, gives a defensible score. Date conflict should be near-decisive: someone cannot be active before birth or after death.

When should I link to an external authority?

Link to VIAF, Wikidata, or a national biography only when the person genuinely has an established record — typically prominent figures. For the vast majority of archival individuals (a churchwarden, an apprentice, a litigant) no authority entry exists, and you should mint a local stable identifier instead. Forcing an obscure person onto a famous namesake's Wikidata id is a classic, damaging error.

How do I handle uncertainty honestly?

Do not pretend to certainty you lack. Record three relationship types: same_as (confident), possibly_same_as (plausible, evidence noted), and different_from (actively ruled out). Each gets a confidence value and a one-line justification.

json

{
  "person_id": "loc:p0481",
  "mentions": ["doc12:230-241", "doc12:1980-1990"],
  "links": [{"type": "possibly_same_as", "target": "loc:p0512", "confidence": 0.4,
             "note": "same parish, but 30-year date gap; could be father"}]
}

This lets a later researcher revisit your judgement instead of inheriting a silent merge.

How do I know my disambiguation is reliable?

Evaluate with pairwise or B-cubed precision and recall against a gold cluster set you label by hand for one surname block. Pairwise precision below 0.9 means you are over-merging and inventing people who lived two lives in one record; low recall means you are leaving the same person fragmented. Report both numbers in your methods note.

Key Takeaways

Disambiguation groups mentions into people; it is not string matching.
Block first to keep comparisons tractable, then score on evidence.
Use dates, place, occupation, kin, and co-mentions — not spelling alone.
Auto-merge only the confident band; send the rest to human review.
Mint local identifiers for ordinary people; reserve authority links for established figures.
Record possibly_same_as with confidence and a cited reason.
Validate with pairwise or B-cubed precision and recall on a gold block.

Frequently Asked Questions

What does disambiguating historical people mean?

It is deciding which name mentions refer to the same real individual and which refer to different people who happen to share a name. The output is a set of person records, each linking the mentions that belong to it.

How is disambiguation different from extraction?

Extraction finds name mentions in text; disambiguation groups those mentions into people. You can extract "40 mentions of John Smith" and still need to decide whether they are two men or five.

What evidence helps tell two same-named people apart?

Dates of activity, place, occupation, kin relations, and co-occurring names. A "John Smith" active in 1610 in York with a wife Anne is almost certainly not the John Smith of 1690 in Bristol.

Can I automate disambiguation completely?

No. You can automate clustering and candidate ranking, but contested cases need human judgement and documented decisions. Treat the algorithm as a triage tool, not an oracle.

Should I link to Wikidata or VIAF during disambiguation?

Link to an authority only for people who genuinely have an established identifier, usually well-known figures. Most archival people are not in any authority file, so you mint local identifiers instead.

How do I record an uncertain decision?

Give every merge or split a confidence value and a short note citing the evidence. Keep a "possibly same as" link rather than forcing a hard merge when the evidence is thin.

Why can't I just merge identical names? ​

How do I structure the disambiguation workflow? ​

What features actually separate two same-named people? ​

When should I link to an external authority? ​

How do I handle uncertainty honestly? ​

How do I know my disambiguation is reliable? ​

Key Takeaways ​

Frequently Asked Questions ​

What does disambiguating historical people mean? ​

How is disambiguation different from extraction? ​

What evidence helps tell two same-named people apart? ​

Can I automate disambiguation completely? ​

Should I link to Wikidata or VIAF during disambiguation? ​

How do I record an uncertain decision? ​

Related reading ​