Appearance
When you train NER for historical text and it underperforms, the cause is almost always one of four things: a train/inference text mismatch, too little or unbalanced annotation, tokeniser misalignment, or evaluating on data that is cleaner than production. Diagnose by checking those in order. Most "the model is bad" reports are actually "the data pipeline is inconsistent," and fixing the pipeline recovers most of the lost accuracy without touching the model.
First, reproduce the failure on real data
Before changing anything, run the model on a small batch of genuine production pages, not your clean test set. Historical NER often scores 0.90+ F1 on a curated test split and collapses to 0.60 on raw OCR output. If you see that gap, the problem is distribution shift, and no amount of hyperparameter tuning will close it. Fix the data first.
bash
# spaCy: evaluate against a gold file drawn from the SAME source as production
python -m spacy evaluate ./model ./gold_production.spacy --output metrics.jsonWhy does training accuracy look great but production fail?
Three usual root causes, in order of frequency:
- Clean-train, noisy-deploy. You annotated tidy transcriptions; production carries OCR garble. Inject noisy examples.
- Topic leakage. Your train and test splits share documents, so the model memorises specific names. Split by document or archive, never by sentence.
- Over-reliance on capitalisation. Common in English NER; archaic texts capitalise Nouns Liberally, so the signal misleads.
How do I fix the "everything capitalised is a PERSON" problem?
Augment the training set so capitalisation stops being a free predictor. Add capitalised non-entities (sentence starts, emphasised common nouns) labelled O, and where your sources include them, lowercase entity mentions. A useful diagnostic is a confusion matrix by surface feature:
| Symptom | Likely cause | Fix |
|---|---|---|
| Sentence-start words tagged PERSON | Capitalisation overfit | Add capitalised O examples |
| Spans off by one token | Tokeniser/tag misalignment | Re-align BILOU to tokens |
| Misses variant spellings | Single-spelling training | Add spelling variants |
| Hallucinated entities on OCR junk | No noisy examples | Inject realistic OCR noise |
| Train F1 high, test F1 low | Document leakage | Split by document |
My spans are off by one. What now?
Span drift comes from a mismatch between how characters map to tokens at annotation time versus inference time. Validate alignment explicitly before training:
python
import spacy
from spacy.training import offsets_to_biluo_tags
nlp = spacy.blank("en")
text = "the towne of Yorke"
ents = [(13, 18, "GPE")]
doc = nlp.make_doc(text)
tags = offsets_to_biluo_tags(doc, ents)
print(tags) # any '-' tag means the span misses a token boundaryAny - in the BILOU output means that annotation cannot be learned cleanly; fix the offset or the tokeniser, not the model.
How do I handle line-break hyphenation and long-s?
Old print introduces two systematic span-breakers: words split across lines (Lon-\ndon) and the long-s (ſ read as f). De-hyphenate in preprocessing and either map the long-s to s or train on the original glyphs, but be consistent. Crucially, keep an offset map so a predicted span can be traced back to the exact pixels on the source image for verification.
Why is my model worse after adding more data?
Counterintuitive but common: adding inconsistently annotated data hurts. If two annotators disagree on whether "the Crown" is an ORG, the model sees contradictory signal. Measure inter-annotator agreement (Cohen's kappa above 0.8 is a reasonable bar), write a guidelines document, and reconcile disagreements before retraining. Quality of labels beats quantity every time at humanities scale.
A repeatable debugging loop
text
1. Evaluate on real production sample, not clean test
2. If gap is large -> distribution shift -> match training text
3. Check tokeniser/tag alignment with offsets_to_biluo_tags
4. Inspect errors by category (case, spelling, OCR, leakage)
5. Fix the single biggest error class, retrain, re-evaluate
6. Repeat; change one thing per iterationKey Takeaways
- Reproduce failures on real production pages before tuning anything.
- The most common root cause is clean-train, noisy-deploy distribution shift.
- Split train/test by document or archive to prevent name memorisation.
- Validate that every annotated span lands on a token boundary before training.
- De-hyphenate and normalise glyphs consistently, keeping an offset map back to the source.
- Inconsistent labels hurt; measure inter-annotator agreement before adding data.
- Change one variable per iteration so you know what actually helped.
Frequently Asked Questions
Why does my historical NER model score high on training but fail on real pages?
This is overfitting plus distribution shift. Your training set probably came from clean, modern-transcribed text while real pages carry OCR noise and spelling variation. Add noisy and varied examples to training, and always evaluate on a held-out set drawn from the same messy source as production.
How many annotated examples do I need to train historical NER?
When fine-tuning a pretrained transformer you can get usable results from 200 to 500 annotated sentences per entity type. Training a model from scratch needs thousands. If you have fewer than 100 examples, consider gazetteer-assisted rules instead.
My model tags every capitalised word as a person. How do I fix it?
The model has learned that capitalisation predicts PERSON, which fails on sentence-initial words and archaic capitalisation conventions. Balance the training set with capitalised non-entities, and add lowercase entity examples so the model stops relying on case alone.
Should I normalise spelling before or after NER?
Train and predict on the same representation. If your final corpus is normalised, normalise before NER; if you preserve original spelling, train on original spelling. A mismatch between training and inference text is one of the most common silent accuracy killers.
Why are my entity spans off by one token?
Span offset errors almost always come from tokeniser mismatch or misaligned BILOU/IOB tags. Verify that the tokeniser used at training matches inference, and validate that every annotated span maps cleanly onto a token boundary before training.
How do I handle entities split across line breaks in old print?
Hyphenated line-break splits like Lon-\nLondon break spans. De-hyphenate and rejoin words in preprocessing, but log every change so you can map predictions back to original character offsets in the source image.