Best Practices to Avoid entity extraction pitfalls

Avoid entity extraction pitfalls by assuming, from the first line of input, that your historical source is noisy, multilingual and full of forms your model has never seen — then designing the pipeline to surface its own failures instead of hiding them. The three pitfalls that sink real projects are OCR noise mistaken for signal, hallucinated or out-of-span entities, and aggregate metrics that mask per-class collapse. Everything below is about catching those before they reach a publication or a finding aid.

Why does OCR noise cause the most damage?

A named-entity model learns surface patterns. Feed it Edw ard split across a line break, a long-s rendered as f (so Massachusetts becomes Maffachufetts), or a marginal note bleeding into a column, and entities vanish without warning. The model does not error; it simply predicts nothing.

Do the cheap repairs deterministically before extraction, but stop there:

python

import re

def light_clean(text: str) -> str:
    text = re.sub(r"(\w)-\n(\w)", r"\1\2", text)   # join hyphenated line breaks
    text = text.replace("ſ", "s")              # long-s to s
    text = re.sub(r"[ \t]+", " ", text)             # collapse runs of spaces
    return text

Resist the urge to "modernise" spellings here. If your downstream gazetteer expects Salisburie or Marseilles, rewriting it to a modern form destroys the match you need later.

How do I stop the model inventing entities?

Generative extraction (and some seq2seq NER) will happily return a name that is not in the text. Make hallucination structurally impossible by demanding a character offset and verifying it:

python

def keep_grounded(spans, source):
    out = []
    for s in spans:
        surface = source[s["start"]:s["end"]]
        if surface.strip().lower() == s["text"].strip().lower():
            out.append(s)
    return out  # anything not a real substring is dropped

If a prediction has no honest span in the page, it is gone. This one rule eliminates a whole category of embarrassing errors in computational history.

What metrics actually tell you the truth?

Aggregate precision and recall lie by averaging. A model can score 0.91 overall while recall on 18th-century organisations sits at 0.38. Always disaggregate by entity type and by source period.

Pitfall	Symptom you see	What it hides	Fix
OCR noise	Suddenly fewer entities on one collection	Whole pages with zero recall	Light clean + sample by collection
Hallucination	Plausible names absent from page	Fabricated spans in output	Offset verification
Aggregate metrics	"Good enough" F1	A class at near-zero recall	Per-type, per-decade breakdown
Over-normalisation	Gazetteer match rate drops	Lost historical spellings	Keep original surface form

When does a language model help, and when does it hurt?

LLMs handle rare and variant spellings better than a narrow trained model, which is genuinely useful for early-modern text. But they introduce anachronism: silently translating Constantinople to Istanbul, normalising the Grand Turk into a modern country, or inventing a death date. Pin the model version, log every prompt and response, and never skip the gold-sample check.

How do you make results defensible later?

Reproducibility is the difference between a dataset a reviewer trusts and one they reject. Record, per run: the model and version, the cleaning steps, the prompt or config, and a hash of the input corpus. Store predictions with their offsets so any claim can be traced to a page.

yaml

run:
  model: "xlm-roberta-base-ner-hist@2025-01"
  cleaning: ["join_hyphens", "long_s", "ws_collapse"]
  corpus_sha256: "9c1f..."
  gold_sample: "review/gold_v3.jsonl"

How big should the manual sample be?

Draw 200-300 entity spans at random, stratified across periods and types, and adjudicate them by hand against the page image. That sample size is large enough to detect a class collapsing to near-zero recall — the failure that silently ruins downstream prosopography or mapping.

Key Takeaways

Treat every historical source as noisy and multilingual by default; design the pipeline to expose its own failures.
Do cheap, deterministic OCR repairs before extraction, but never over-normalise away spellings your gazetteer needs.
Verify every predicted span is a real substring of the page to make hallucination impossible.
Break metrics down by entity type and by decade; aggregate F1 hides class collapse.
LLMs trade rare-spelling robustness for anachronism risk — pin versions and validate either way.
Log model, cleaning, prompt and corpus hash so results stay defensible and reproducible.
Adjudicate 200-300 stratified spans by hand before declaring quality acceptable.

Frequently Asked Questions

What is the single most common entity extraction pitfall in historical sources?

Treating OCR noise as signal. A model trained on clean modern text will silently drop or mangle entities the moment it meets long-s, hyphenated line breaks or column bleed, and those losses never show up unless you sample by hand.

Should I fix OCR before extraction or extract first?

Fix the cheap, deterministic errors first (line-break hyphens, long-s, common confusions) but do not attempt full normalisation. Over-cleaning rewrites historical spellings your gazetteer might actually need for matching.

How do I stop a model inventing entities that are not in the text?

Constrain decoding to offsets in the source, log a character span for every prediction, and reject any entity whose surface form is not a substring of the page. Hallucinated spans are then impossible by construction.

Why do my precision numbers look great but the project still fails?

Aggregate precision hides per-class and per-period collapse. Organisations from 1700 may score 0.4 while modern persons score 0.95, and the average looks fine. Always break metrics down by entity type and by source decade.

Is a large language model safer than a trained NER model for history?

Not automatically. LLMs reduce some pitfalls (rare spellings) but add new ones: anachronistic normalisation, silent translation and fabricated dates. Pin a version, log prompts, and validate against a gold sample either way.

How big should my manual review sample be?

For a defensible estimate, review at least 200-300 randomly drawn entity spans stratified across periods and types. That is enough to catch a class collapsing to near-zero recall, which a 30-span glance will miss.

Why does OCR noise cause the most damage? ​

How do I stop the model inventing entities? ​

What metrics actually tell you the truth? ​

When does a language model help, and when does it hurt? ​

How do you make results defensible later? ​

How big should the manual sample be? ​

Key Takeaways ​

Frequently Asked Questions ​

What is the single most common entity extraction pitfall in historical sources? ​

Should I fix OCR before extraction or extract first? ​

How do I stop a model inventing entities that are not in the text? ​

Why do my precision numbers look great but the project still fails? ​

Is a large language model safer than a trained NER model for history? ​

How big should my manual review sample be? ​

Related reading ​