Appearance
To extract people from primary sources, run a named entity recognition (NER) model over cleaned, transcribed text, filter to the PERSON label, and then verify every span by hand against the original. The model gives you a fast draft; the human pass turns it into evidence you can cite. For a typical corpus of a few hundred pages, expect a usable list within an afternoon if your text is already digitised.
The hard part is rarely the model. It is the messiness of historical text: variant spellings, honorifics, OCR noise, and people referred to only by office or kinship ("the widow Hale", "my lord of Norfolk"). This guide walks through a workflow that survives all of that.
What does a person extraction pipeline actually look like?
At a high level the pipeline has four stages: prepare text, run extraction, capture mentions with offsets, then review. Keep them separate so you can rerun any one stage without redoing the others.
python
import spacy
nlp = spacy.load("en_core_web_trf")
text = open("petition_1647.txt", encoding="utf-8").read()
doc = nlp(text)
mentions = [
{"surface": ent.text, "start": ent.start_char, "end": ent.end_char}
for ent in doc.ents
if ent.label_ == "PERSON"
]Storing start_char and end_char is non-negotiable. Offsets are what let a reviewer jump straight to the mention in context and what let you regenerate a corrected list later without losing provenance.
How do I prepare the text so the model behaves?
Three cheap fixes recover most lost recall:
- Normalise the long s (
ſbecomess) and resolve common ligatures. - Repair line-break hyphenation: join
Crom-\nwellintoCromwellbefore NER, or the model sees two fragments. - Lowercase nothing. Capitalisation is a strong signal for names; preserve it.
A quick regex pass handles hyphenation:
python
import re
text = re.sub(r"(\w+)-\n(\w+)", r"\1\2", text)Should I trust a single model, or combine approaches?
Combine them. A model and a gazetteer fail in different places, so their union has higher recall than either alone.
| Approach | Strengths | Weaknesses |
|---|---|---|
| Transformer NER | Finds unseen names, uses context | Misses archaic forms, needs a GPU to be fast |
| Gazetteer match | Perfect recall on known figures | Blind to anyone not on the list |
| Regex on honorifics | Catches "Mr.", "Mistress", "Sir" patterns | Noisy; flags many false positives |
Run all three, tag each mention with its source, then merge. Where two methods agree, confidence is high; where only the regex fires, send it to the review queue.
How do I deal with titles and roles instead of names?
Many early-modern people appear as an office, not a personal name. Decide your policy up front and document it: do you record "the Recorder of London" as a person mention, a separate role mention, or both? I recommend capturing the role span with a role flag so you can later resolve it to a named individual using other records, without conflating it with a true name at extraction time.
What about the same person spelled five ways?
Resist fixing this during extraction. Capture every surface form verbatim, then cluster afterwards. A petition might contain "Cromwell", "Cromwel", "Crumwell", and "O. C." for one man — collapsing them too early destroys the offsets you need for verification. Disambiguation is its own discipline; see the related guide below.
How do I check the result is good enough?
Sample 50 mentions at random and compute precision against your own judgement. Then read one full document and list every person you find by eye to estimate recall on a known page. If precision is below about 0.9 you are creating cleanup work; if recall is below 0.8 you are missing too many people to make claims about who appears.
Key Takeaways
- Extraction is answer-first: run NER, filter PERSON, verify by hand.
- Always store character offsets so every name is traceable to the source.
- Normalise long s, ligatures, and hyphenation before the model runs.
- Combine a model, a gazetteer, and honorific rules for higher recall.
- Keep extraction and disambiguation as separate stages.
- Record offices and roles with a flag rather than treating them as names.
- Validate with a precision sample and a recall read of one full page.
Frequently Asked Questions
What is the fastest way to extract people from a primary source?
Run a transformer NER model such as spaCy's en_core_web_trf over the cleaned text, keep only PERSON spans, then review them by hand. For a few hundred pages this gets you a draft list in minutes, with a couple of hours of correction.
Should I use a rule-based gazetteer or a machine-learning model?
Use both. A name gazetteer catches known recurring figures and honorific patterns reliably, while a model finds people the list misses. Merge the two outputs and deduplicate.
Why does NER miss historical names?
Models are trained on modern text, so archaic spelling, OCR errors, and titles like "Goodwife" or "the Lord Privy Seal" fall outside their distribution. Normalising spelling and adding domain rules recovers most of these.
How do I handle the same person written many different ways?
Extraction and disambiguation are separate steps. First capture every mention as-is with its offset, then cluster variants into person records afterwards using string similarity plus contextual evidence.
What output format should an extraction produce?
Store each mention with the document id, character offsets, the surface form, and a confidence score. Offsets let you trace every name back to the exact spot in the source, which is essential for verification.
Can I extract people from handwritten manuscripts?
Yes, but run HTR (e.g. Transkribus) first to get text, then treat the output as noisy OCR. Expect lower recall and budget extra correction time for ambiguous letterforms.