Skip to content
Named Entities in History

Disambiguating historical people means turning a pile of name mentions into a set of person records, deciding for each pair of mentions whether they are the same individual. The reliable workflow is: block candidates by surname, score each pair on contextual evidence (dates, places, relations), auto-merge the confident matches, and route the rest to human review with documented reasons. Done well, this is what lets you say "this John Wright" rather than "a John Wright" in your scholarship.

This is the step most projects underestimate. Extraction is fast; deciding that three "Mary Coke" mentions are two women, not one, is slow, evidential, and where the historical argument lives.

Why can't I just merge identical names?

Because name identity and person identity are different things. Two records reading "Thomas Clark" may be father and son, or unrelated namesakes a county apart. Conversely, one person appears as "Cromwell", "O. C.", and "the Lord Protector". Merging on string equality alone produces both false merges and false splits — the two errors you are trying to balance.

How do I structure the disambiguation workflow?

Use a four-stage pipeline:

  1. Block — group mentions that could be the same, usually by normalised surname or a phonetic key, so you never compare every pair against every other.
  2. Score — for each within-block pair, compute a similarity score from several features.
  3. Cluster — merge pairs above a high threshold automatically; flag the middle band.
  4. Review — a human resolves flagged clusters and records the decision.
python
import jellyfish

def block_key(name):
    surname = name.split()[-1].lower()
    return jellyfish.metaphone(surname)

# mentions sharing a block_key are candidate matches

Blocking turns an O(n²) problem into something tractable: a corpus with 5,000 mentions becomes a few hundred small comparison sets.

What features actually separate two same-named people?

Score pairs on evidence, not spelling alone:

FeatureWhy it helpsExample signal
Date overlapPeople have lifespansactive 1610 vs active 1690
PlaceMost people stay regionalYork parish vs Bristol parish
Occupation/titleStable identity marker"draper" vs "mariner"
Kin termsStrong personal anchor"wife Anne", "son of Robert"
Co-mentionsSocial networks repeatappears beside the same names

A weighted sum of these, calibrated on a hand-labelled sample, gives a defensible score. Date conflict should be near-decisive: someone cannot be active before birth or after death.

Link to VIAF, Wikidata, or a national biography only when the person genuinely has an established record — typically prominent figures. For the vast majority of archival individuals (a churchwarden, an apprentice, a litigant) no authority entry exists, and you should mint a local stable identifier instead. Forcing an obscure person onto a famous namesake's Wikidata id is a classic, damaging error.

How do I handle uncertainty honestly?

Do not pretend to certainty you lack. Record three relationship types: same_as (confident), possibly_same_as (plausible, evidence noted), and different_from (actively ruled out). Each gets a confidence value and a one-line justification.

json
{
  "person_id": "loc:p0481",
  "mentions": ["doc12:230-241", "doc12:1980-1990"],
  "links": [{"type": "possibly_same_as", "target": "loc:p0512", "confidence": 0.4,
             "note": "same parish, but 30-year date gap; could be father"}]
}

This lets a later researcher revisit your judgement instead of inheriting a silent merge.

How do I know my disambiguation is reliable?

Evaluate with pairwise or B-cubed precision and recall against a gold cluster set you label by hand for one surname block. Pairwise precision below 0.9 means you are over-merging and inventing people who lived two lives in one record; low recall means you are leaving the same person fragmented. Report both numbers in your methods note.

Key Takeaways

  • Disambiguation groups mentions into people; it is not string matching.
  • Block first to keep comparisons tractable, then score on evidence.
  • Use dates, place, occupation, kin, and co-mentions — not spelling alone.
  • Auto-merge only the confident band; send the rest to human review.
  • Mint local identifiers for ordinary people; reserve authority links for established figures.
  • Record possibly_same_as with confidence and a cited reason.
  • Validate with pairwise or B-cubed precision and recall on a gold block.

Frequently Asked Questions

What does disambiguating historical people mean?

It is deciding which name mentions refer to the same real individual and which refer to different people who happen to share a name. The output is a set of person records, each linking the mentions that belong to it.

How is disambiguation different from extraction?

Extraction finds name mentions in text; disambiguation groups those mentions into people. You can extract "40 mentions of John Smith" and still need to decide whether they are two men or five.

What evidence helps tell two same-named people apart?

Dates of activity, place, occupation, kin relations, and co-occurring names. A "John Smith" active in 1610 in York with a wife Anne is almost certainly not the John Smith of 1690 in Bristol.

Can I automate disambiguation completely?

No. You can automate clustering and candidate ranking, but contested cases need human judgement and documented decisions. Treat the algorithm as a triage tool, not an oracle.

Link to an authority only for people who genuinely have an established identifier, usually well-known figures. Most archival people are not in any authority file, so you mint local identifiers instead.

How do I record an uncertain decision?

Give every merge or split a confidence value and a short note citing the evidence. Keep a "possibly same as" link rather than forcing a hard merge when the evidence is thin.