When to Do NER on old languages

You should do NER on old languages when three conditions hold: the corpus is large enough that manual tagging is impractical, entities are genuinely your unit of analysis, and your transcription is clean enough to be trustworthy. If any one fails — a tiny corpus, a question that is not about entities, or noisy OCR — a rule-based gazetteer or simple close reading usually wins on both accuracy and time. NER is a tool with real setup costs, not a default.

The temptation is to reach for a transformer because it worked on modern English. On Latin, Old Norse, or Early Modern Dutch, the same model can score so poorly that you spend more time correcting its output than you would have spent tagging by hand.

When is NER actually the right call?

Run the decision against these signals:

Signal	Favours NER	Favours manual / rules
Corpus size	thousands of pages	a handful of documents
Research question	"who/where appears, at scale"	one close argument
Transcription quality	CER under ~10%	heavy OCR/HTR noise
Entity novelty	many unknown names	a fixed, known list
Reusability	ongoing project	one-off task

If most rows land on the left, build the pipeline. If they cluster right, do not.

Why do modern models fail on historical text?

Three distribution shifts break them at once:

Orthography — non-standard, variable spelling that the model never saw in training.
Morphology — heavily inflected languages like Latin mean one name appears in many case forms.
Capitalisation and punctuation — pre-modern conventions differ or are absent, removing signals the model relied on.

A model trained on CoNLL news data simply has no representation for "Eboracum" or for a Latin genitive ending changing a place name's surface form.

What works better for old languages?

Use purpose-built tools and expect to adapt them. The Classical Language Toolkit (CLTK) ships processing for Latin, Greek, and other classical languages; spaCy supports custom historical pipelines; and multilingual transformers can be fine-tuned on a modest amount of in-domain data.

python

from cltk import NLP
cltk_nlp = NLP(language="lat")
doc = cltk_nlp.analyze(text="Gaius Iulius Caesar in Galliam profectus est")
# inspect lemmatised, morphologically tagged tokens before entity logic

Lemmatising first is often the unlock: collapse inflected forms to a lemma so "Caesar", "Caesaris", and "Caesarem" become one anchor for matching and tagging.

How much annotated data do I really need?

For fine-tuning a multilingual base model on a single language and entity type, a few hundred well-chosen annotated sentences move the needle, and one to two thousand gives a usable model. Below a couple of hundred examples, a starved model typically underperforms a good gazetteer plus morphological rules — so spend that early effort on a name list and lemmatiser instead of labelling too little data for too weak a model.

Should I combine rules and machine learning?

Almost always, yes — and this is the practical heart of the answer. Build a hybrid:

A gazetteer matches known places and recurring persons with near-perfect precision.
A lemmatiser normalises inflection so the gazetteer hits all case forms.
A trained or fine-tuned model finds the novel and contextual entities the list cannot.
A merge step unions the outputs, preferring the gazetteer where both fire.

Each component covers the other's blind spot, and you can ship the gazetteer layer immediately while the model trains.

How do I know whether it was worth it?

Measure against a held-out, hand-annotated test set in the actual language. Report precision, recall, and F1 per entity type — aggregate numbers hide that places may score 0.9 while persons sit at 0.6. If the model's F1 does not clear what your gazetteer alone achieves, the machine-learning layer is not earning its complexity, and you should drop it.

Key Takeaways

Use NER on old languages only for large corpora where entities are the unit of analysis.
Off-the-shelf modern models usually fail on archaic spelling and morphology.
Lemmatise first so inflected names collapse to a matchable anchor.
A few hundred annotations help; under that, prefer gazetteers and rules.
Noisy OCR/HTR caps performance — fix transcription before blaming the model.
Hybrid rule-plus-model pipelines beat either approach alone.
Justify the model with per-type F1 against a hand-labelled test set.

Frequently Asked Questions

Is NER worth running on Latin or Middle English sources?

It is worth it when you have enough text that hand-tagging is impractical and when entities are the unit of analysis. For a single short charter, manual annotation is faster and more accurate than building a pipeline.

Do off-the-shelf NER models work on old languages?

Rarely well. Models trained on modern news collapse on archaic spelling and grammar. You usually need a model trained on historical data, such as those from CLTK or a fine-tuned multilingual transformer.

How much annotated data do I need to fine-tune?

A few hundred to a couple of thousand annotated sentences can lift a multilingual base model meaningfully for a single entity type. Below that, rule-based or gazetteer methods often beat a starved model.

When should I avoid NER entirely?

Avoid it when the corpus is tiny, when transcription quality is poor enough that text is unreliable, or when you need exhaustive precision on a closed set — a gazetteer or close reading will serve better.

Does OCR or HTR quality change the decision?

Heavily. NER degrades fast on noisy text. If your character error rate is above roughly 10 percent, fix transcription first or expect recall to suffer regardless of the model.

Which is better for old languages: rules or machine learning?

It depends on the entity. Places and known persons suit gazetteers; novel or contextual entities suit a trained model. Hybrid pipelines that combine both consistently outperform either alone.

When is NER actually the right call? ​

Why do modern models fail on historical text? ​

What works better for old languages? ​

How much annotated data do I really need? ​

Should I combine rules and machine learning? ​

How do I know whether it was worth it? ​

Key Takeaways ​

Frequently Asked Questions ​

Is NER worth running on Latin or Middle English sources? ​

Do off-the-shelf NER models work on old languages? ​

How much annotated data do I need to fine-tune? ​

When should I avoid NER entirely? ​

Does OCR or HTR quality change the decision? ​

Which is better for old languages: rules or machine learning? ​

Related reading ​