Appearance
Avoid entity extraction pitfalls by assuming, from the first line of input, that your historical source is noisy, multilingual and full of forms your model has never seen — then designing the pipeline to surface its own failures instead of hiding them. The three pitfalls that sink real projects are OCR noise mistaken for signal, hallucinated or out-of-span entities, and aggregate metrics that mask per-class collapse. Everything below is about catching those before they reach a publication or a finding aid.
Why does OCR noise cause the most damage?
A named-entity model learns surface patterns. Feed it Edw ard split across a line break, a long-s rendered as f (so Massachusetts becomes Maffachufetts), or a marginal note bleeding into a column, and entities vanish without warning. The model does not error; it simply predicts nothing.
Do the cheap repairs deterministically before extraction, but stop there:
python
import re
def light_clean(text: str) -> str:
text = re.sub(r"(\w)-\n(\w)", r"\1\2", text) # join hyphenated line breaks
text = text.replace("ſ", "s") # long-s to s
text = re.sub(r"[ \t]+", " ", text) # collapse runs of spaces
return textResist the urge to "modernise" spellings here. If your downstream gazetteer expects Salisburie or Marseilles, rewriting it to a modern form destroys the match you need later.
How do I stop the model inventing entities?
Generative extraction (and some seq2seq NER) will happily return a name that is not in the text. Make hallucination structurally impossible by demanding a character offset and verifying it:
python
def keep_grounded(spans, source):
out = []
for s in spans:
surface = source[s["start"]:s["end"]]
if surface.strip().lower() == s["text"].strip().lower():
out.append(s)
return out # anything not a real substring is droppedIf a prediction has no honest span in the page, it is gone. This one rule eliminates a whole category of embarrassing errors in computational history.
What metrics actually tell you the truth?
Aggregate precision and recall lie by averaging. A model can score 0.91 overall while recall on 18th-century organisations sits at 0.38. Always disaggregate by entity type and by source period.
| Pitfall | Symptom you see | What it hides | Fix |
|---|---|---|---|
| OCR noise | Suddenly fewer entities on one collection | Whole pages with zero recall | Light clean + sample by collection |
| Hallucination | Plausible names absent from page | Fabricated spans in output | Offset verification |
| Aggregate metrics | "Good enough" F1 | A class at near-zero recall | Per-type, per-decade breakdown |
| Over-normalisation | Gazetteer match rate drops | Lost historical spellings | Keep original surface form |
When does a language model help, and when does it hurt?
LLMs handle rare and variant spellings better than a narrow trained model, which is genuinely useful for early-modern text. But they introduce anachronism: silently translating Constantinople to Istanbul, normalising the Grand Turk into a modern country, or inventing a death date. Pin the model version, log every prompt and response, and never skip the gold-sample check.
How do you make results defensible later?
Reproducibility is the difference between a dataset a reviewer trusts and one they reject. Record, per run: the model and version, the cleaning steps, the prompt or config, and a hash of the input corpus. Store predictions with their offsets so any claim can be traced to a page.
yaml
run:
model: "xlm-roberta-base-ner-hist@2025-01"
cleaning: ["join_hyphens", "long_s", "ws_collapse"]
corpus_sha256: "9c1f..."
gold_sample: "review/gold_v3.jsonl"How big should the manual sample be?
Draw 200-300 entity spans at random, stratified across periods and types, and adjudicate them by hand against the page image. That sample size is large enough to detect a class collapsing to near-zero recall — the failure that silently ruins downstream prosopography or mapping.
Key Takeaways
- Treat every historical source as noisy and multilingual by default; design the pipeline to expose its own failures.
- Do cheap, deterministic OCR repairs before extraction, but never over-normalise away spellings your gazetteer needs.
- Verify every predicted span is a real substring of the page to make hallucination impossible.
- Break metrics down by entity type and by decade; aggregate F1 hides class collapse.
- LLMs trade rare-spelling robustness for anachronism risk — pin versions and validate either way.
- Log model, cleaning, prompt and corpus hash so results stay defensible and reproducible.
- Adjudicate 200-300 stratified spans by hand before declaring quality acceptable.
Frequently Asked Questions
What is the single most common entity extraction pitfall in historical sources?
Treating OCR noise as signal. A model trained on clean modern text will silently drop or mangle entities the moment it meets long-s, hyphenated line breaks or column bleed, and those losses never show up unless you sample by hand.
Should I fix OCR before extraction or extract first?
Fix the cheap, deterministic errors first (line-break hyphens, long-s, common confusions) but do not attempt full normalisation. Over-cleaning rewrites historical spellings your gazetteer might actually need for matching.
How do I stop a model inventing entities that are not in the text?
Constrain decoding to offsets in the source, log a character span for every prediction, and reject any entity whose surface form is not a substring of the page. Hallucinated spans are then impossible by construction.
Why do my precision numbers look great but the project still fails?
Aggregate precision hides per-class and per-period collapse. Organisations from 1700 may score 0.4 while modern persons score 0.95, and the average looks fine. Always break metrics down by entity type and by source decade.
Is a large language model safer than a trained NER model for history?
Not automatically. LLMs reduce some pitfalls (rare spellings) but add new ones: anachronistic normalisation, silent translation and fabricated dates. Pin a version, log prompts, and validate against a gold sample either way.
How big should my manual review sample be?
For a defensible estimate, review at least 200-300 randomly drawn entity spans stratified across periods and types. That is enough to catch a class collapsing to near-zero recall, which a 30-span glance will miss.