Skip to content
Quantitative History Methods

To link historical records, you compare entries from two sources on shared attributes such as name, age, and birthplace, score how similar each candidate pair is, and accept pairs above a threshold as the same person. Because historical sources almost never carry a unique identifier, you treat linkage as a probabilistic decision built from several imperfect signals rather than an exact key lookup. The reliable workflow is: clean and standardise, block, compare, score, threshold, then review.

What does record linkage actually do?

It answers one question repeatedly: do these two rows describe the same entity? Given an 1881 census household and an 1891 one, you want to follow the same Mary Ashworth across the decade despite a changed age, a new address, and a transcriber's spelling. The output is a crosswalk table of id_1, id_2, score that downstream analysis (mobility, mortality, wages) depends on.

How do I prepare the data first?

Standardisation is where most accuracy is won or lost. Lowercase everything, strip punctuation, and normalise names against a variant table (Wm to William, Eliz to Elizabeth). Convert ages to estimated birth years so a one-year census rounding error becomes a tolerance you control.

r
library(dplyr)
clean <- raw |>
  mutate(
    surname  = tolower(trimws(surname)),
    forename = recode(tolower(forename), wm = "william", jas = "james"),
    birthyr  = census_year - age
  )

Keep the raw columns alongside the cleaned ones. You will want to audit any surprising match later.

What is blocking and how do I choose a key?

Comparing every row in source A against every row in source B is n * m operations — for two censuses that is trillions of pairs. Blocking limits comparisons to records that agree on a coarse attribute. A good key is stable across sources and not too common.

Blocking keyTrue links keptPairs to compareVerdict
Exact surnameLowLowDrops spelling variants
Soundex(surname) + birth decadeHighModerateSolid default
First letter onlyVery highVery highToo slow
Birthplace parishHighLowGreat if recorded

Use Soundex or Double Metaphone of the surname plus birth decade as your starting block. Run two or three passes with different keys and union the results to recover links a single key would miss.

How do I score candidate pairs?

Within each block, compare attributes and combine the comparisons. Jaro-Winkler handles name typos; absolute difference handles age. The classic Fellegi-Sunter model weights each agreement by how informative it is — agreeing on a rare surname counts far more than agreeing on a common age.

python
import recordlinkage as rl
c = rl.Compare()
c.string("surname", "surname", method="jarowinkler", label="sn")
c.string("forename", "forename", method="jarowinkler", label="fn")
c.numeric("birthyr", "birthyr", offset=1, scale=2, label="by")
features = c.compute(pairs, dfA, dfB)
ecm = rl.ECMClassifier()
ecm.fit(features)
scores = ecm.predict(features)

Where should I set the threshold?

Plot the score distribution: you usually see a high-scoring true-match hump, a low-scoring non-match mass, and an ambiguous middle. Accept the high band automatically, reject the low band, and route the middle to manual review. Report your precision and recall against a hand-linked gold-standard sample of 200 to 500 pairs so the numbers are defensible.

What pitfalls trip people up?

  • One-to-many matches. Enforce that each record links at most once unless your design allows split households; resolve conflicts by taking the highest score.
  • Common names. "John Smith, age 30" matches too many people. Down-weight matches on frequent name-age combinations.
  • Systematic bias. Married women change surnames and migrants disappear, so your linked sample over-represents stable, single-surname men. State this limitation explicitly.

Key Takeaways

  • Linkage is probabilistic: combine several weak signals, never trust one field.
  • Clean and standardise before anything else; convert age to birth year.
  • Block to make comparison feasible, and run multiple blocking passes.
  • Score with Jaro-Winkler plus Fellegi-Sunter weights via reclin2 or recordlinkage.
  • Choose thresholds from the score distribution and hand-review the middle band.
  • Always measure precision and recall against a gold-standard sample.
  • Document the linkage rate by sex, region, and name frequency to expose bias.

Frequently Asked Questions

What is record linkage in historical research?

Record linkage is the process of deciding which entries in two or more historical sources refer to the same person, household, or place. It turns separate snapshots, such as two census years, into a single life-course dataset.

No. Most historical sources predate identity numbers, so you link on combinations of name, age, birthplace, and household. You build a probabilistic match from several weak signals rather than relying on one strong key.

What is a blocking key and why does it matter?

A blocking key restricts comparisons to records that share some coarse attribute, such as birth county or Soundex of surname. It cuts billions of pairwise comparisons down to a feasible number without losing many true matches.

How accurate can historical record linkage be?

Well-tuned automated linkage of nineteenth-century censuses typically reaches 60 to 85 percent of true links at a precision above 95 percent. Coverage is lower for women, migrants, and common names.

Which tools are best for beginners?

Start with the R package reclin2 or the Python library recordlinkage. Both expose blocking, comparison, and Fellegi-Sunter scoring without forcing you to code the maths from scratch.

Use machine linkage for scale and consistency, then hand-review the uncertain middle band of scores. A hybrid keeps throughput high while a human resolves the genuinely ambiguous cases.