How to Extract people from primary sources

Q: What is the fastest way to extract people from a primary source?

Run a transformer NER model such as spaCy's en_core_web_trf over the cleaned text, keep only PERSON spans, then review them by hand. For a few hundred pages this gets you a draft list in minutes, with a couple of hours of correction.

To extract people from primary sources, run a named entity recognition (NER) model over cleaned, transcribed text, filter to the PERSON label, and then verify every span by hand against the original. The model gives you a fast draft; the human pass turns it into evidence you can cite. For a typical corpus of a few hundred pages, expect a usable list within an afternoon if your text is already digitised.

The hard part is rarely the model. It is the messiness of historical text: variant spellings, honorifics, OCR noise, and people referred to only by office or kinship ("the widow Hale", "my lord of Norfolk"). This guide walks through a workflow that survives all of that.

What does a person extraction pipeline actually look like?

At a high level the pipeline has four stages: prepare text, run extraction, capture mentions with offsets, then review. Keep them separate so you can rerun any one stage without redoing the others.

python

import spacy

nlp = spacy.load("en_core_web_trf")
text = open("petition_1647.txt", encoding="utf-8").read()
doc = nlp(text)

mentions = [
    {"surface": ent.text, "start": ent.start_char, "end": ent.end_char}
    for ent in doc.ents
    if ent.label_ == "PERSON"
]

Storing start_char and end_char is non-negotiable. Offsets are what let a reviewer jump straight to the mention in context and what let you regenerate a corrected list later without losing provenance.

How do I prepare the text so the model behaves?

Three cheap fixes recover most lost recall:

Normalise the long s (ſ becomes s) and resolve common ligatures.
Repair line-break hyphenation: join Crom-\nwell into Cromwell before NER, or the model sees two fragments.
Lowercase nothing. Capitalisation is a strong signal for names; preserve it.

A quick regex pass handles hyphenation:

python

import re
text = re.sub(r"(\w+)-\n(\w+)", r"\1\2", text)

Should I trust a single model, or combine approaches?

Combine them. A model and a gazetteer fail in different places, so their union has higher recall than either alone.

Approach	Strengths	Weaknesses
Transformer NER	Finds unseen names, uses context	Misses archaic forms, needs a GPU to be fast
Gazetteer match	Perfect recall on known figures	Blind to anyone not on the list
Regex on honorifics	Catches "Mr.", "Mistress", "Sir" patterns	Noisy; flags many false positives

Run all three, tag each mention with its source, then merge. Where two methods agree, confidence is high; where only the regex fires, send it to the review queue.

How do I deal with titles and roles instead of names?

Many early-modern people appear as an office, not a personal name. Decide your policy up front and document it: do you record "the Recorder of London" as a person mention, a separate role mention, or both? I recommend capturing the role span with a role flag so you can later resolve it to a named individual using other records, without conflating it with a true name at extraction time.

What about the same person spelled five ways?

Resist fixing this during extraction. Capture every surface form verbatim, then cluster afterwards. A petition might contain "Cromwell", "Cromwel", "Crumwell", and "O. C." for one man — collapsing them too early destroys the offsets you need for verification. Disambiguation is its own discipline; see the related guide below.

How do I check the result is good enough?

Sample 50 mentions at random and compute precision against your own judgement. Then read one full document and list every person you find by eye to estimate recall on a known page. If precision is below about 0.9 you are creating cleanup work; if recall is below 0.8 you are missing too many people to make claims about who appears.

Key Takeaways

Extraction is answer-first: run NER, filter PERSON, verify by hand.
Always store character offsets so every name is traceable to the source.
Normalise long s, ligatures, and hyphenation before the model runs.
Combine a model, a gazetteer, and honorific rules for higher recall.
Keep extraction and disambiguation as separate stages.
Record offices and roles with a flag rather than treating them as names.
Validate with a precision sample and a recall read of one full page.

Frequently Asked Questions

What is the fastest way to extract people from a primary source?

Run a transformer NER model such as spaCy's en_core_web_trf over the cleaned text, keep only PERSON spans, then review them by hand. For a few hundred pages this gets you a draft list in minutes, with a couple of hours of correction.

Should I use a rule-based gazetteer or a machine-learning model?

Use both. A name gazetteer catches known recurring figures and honorific patterns reliably, while a model finds people the list misses. Merge the two outputs and deduplicate.

Why does NER miss historical names?

Models are trained on modern text, so archaic spelling, OCR errors, and titles like "Goodwife" or "the Lord Privy Seal" fall outside their distribution. Normalising spelling and adding domain rules recovers most of these.

How do I handle the same person written many different ways?

Extraction and disambiguation are separate steps. First capture every mention as-is with its offset, then cluster variants into person records afterwards using string similarity plus contextual evidence.

What output format should an extraction produce?

Store each mention with the document id, character offsets, the surface form, and a confidence score. Offsets let you trace every name back to the exact spot in the source, which is essential for verification.

Can I extract people from handwritten manuscripts?

Yes, but run HTR (e.g. Transkribus) first to get text, then treat the output as noisy OCR. Expect lower recall and budget extra correction time for ambiguous letterforms.

What does a person extraction pipeline actually look like? ​

How do I prepare the text so the model behaves? ​

Should I trust a single model, or combine approaches? ​

How do I deal with titles and roles instead of names? ​

What about the same person spelled five ways? ​

How do I check the result is good enough? ​

Key Takeaways ​

Frequently Asked Questions ​

What is the fastest way to extract people from a primary source? ​

Should I use a rule-based gazetteer or a machine-learning model? ​

Why does NER miss historical names? ​

How do I handle the same person written many different ways? ​

What output format should an extraction produce? ​

Can I extract people from handwritten manuscripts? ​

Related reading ​