POS tag historical languages: A Practical Guide

To POS tag historical languages, pick a tagset (Universal Dependencies is the safe default), normalise spelling so a modern or fine-tuned model can recognise word forms, then run a tagger trained on a historical treebank rather than a modern-only model. The order matters: spelling normalisation before tagging is what separates a 75 percent result from a 93 percent one.

What makes historical POS tagging hard?

Three things break modern taggers. Spelling varies wildly, so the model sees out-of-vocabulary tokens. Grammar differs — Middle English has case marking and verb-second order that modern training data never shows. And OCR or HTR noise adds garbage tokens. Each compounds the next, so a clean pipeline beats a clever model.

Which tagset should I choose?

Universal Dependencies (UPOS) — 17 coarse tags, comparable across languages and periods. Best default.
Penn-Helsinki / PPCME2 — designed for historical English, richer morphology.
STTS / HiNTS — German and historical German.

Choose one your scholarly community already reads. If you must use a niche scheme, also publish a UD mapping so others can reuse your data.

A working end-to-end pipeline

python

import spacy
from spacy.tokens import Doc

nlp = spacy.load("en_core_web_trf")  # or a fine-tuned historical model

def tag(tokens_with_norm):
    words = [t["norm"] for t in tokens_with_norm]
    doc = Doc(nlp.vocab, words=words)
    doc = nlp.get_pipe("tagger")(doc)
    return [(orig["orig"], tk.tag_, tk.pos_)
            for orig, tk in zip(tokens_with_norm, doc)]

Feeding the normalised form to the tagger but reporting the original surface form gives you the best of both: model accuracy plus a faithful transcript.

How do I adapt a model when normalisation is not enough?

For heavily inflected or syntactically distinct languages, normalisation alone leaves you short. Fine-tune. The smallest viable recipe:

bash

# fine-tune a transformer tagger on a treebank in CoNLL-U format
python -m spacy train config.cfg \
  --paths.train ./me_treebank-train.spacy \
  --paths.dev ./me_treebank-dev.spacy \
  --gpu-id 0

Even 3,000 to 5,000 hand-tagged tokens lift accuracy noticeably; 50,000-plus gets you to publishable quality.

Should I use rule-based, statistical, or neural?

Approach	Data needed	Accuracy ceiling	Best for
Rule-based (TreeTagger params)	lexicon	medium	well-described languages
CRF / averaged perceptron	small treebank	high	low-resource, CPU-only
Fine-tuned transformer	medium treebank	highest	when GPU and data exist

Do not jump straight to a transformer. A CRF on normalised text is often within a few points and far cheaper to retrain when your tagset changes.

What about Latin, Greek and other classical languages?

Use the CLTK ecosystem and the Universal Dependencies treebanks (PROIEL, Perseus). These languages have mature treebanks, so you almost never train from scratch — you fine-tune or just run the existing model.

How do I check it actually works?

Hold out a test set the model never saw. Report per-tag accuracy, not just the headline number, and read the confusion matrix. Historical text systematically confuses, say, past participles with finite verbs; the aggregate score hides exactly the errors a historian would care about.

Key Takeaways

Normalise spelling before tagging; it is the highest-leverage step.
Default to Universal Dependencies and publish a mapping for any niche tagset.
Feed the normalised form to the model but keep the original surface form in output.
A CRF on clean text rivals a transformer and is cheaper to retrain.
Reuse existing treebanks (UD, PROIEL, PPCME2) before annotating anything new.
Evaluate per-tag with a confusion matrix, not aggregate accuracy alone.
Latin and Greek have strong CLTK/UD support; rarely train from scratch.

Frequently Asked Questions

Can I use a modern POS tagger on historical text directly?

You can, but accuracy drops sharply on unnormalised text, often from above 95 percent to the 70s or 80s. Normalise spelling first and you recover most of that gap without retraining anything.

Which tagset should I use for historical languages?

Use Universal Dependencies (UPOS) for cross-period comparability, or a period-specific scheme like the Penn-Helsinki tagset for Middle English if your community already uses it. Map between them rather than inventing your own.

How much annotated data do I need to train a historical tagger?

A few thousand manually tagged tokens can fine-tune a transformer to a usable level, and 50,000-plus tokens approaches publishable quality. Treebanks for many historical languages already exist, so check before annotating.

Do I need a GPU to POS tag historical text?

No for rule-based or statistical taggers, which run fine on a CPU. A GPU helps if you fine-tune a transformer, but inference on a few thousand pages is still feasible on CPU overnight.

How do I evaluate a historical POS tagger?

Hold out a manually tagged test set the model never saw, report per-tag accuracy and a confusion matrix, and inspect the worst tags. Aggregate accuracy hides systematic errors on rare but important categories.

What makes historical POS tagging hard? ​

Which tagset should I choose? ​

A working end-to-end pipeline ​

How do I adapt a model when normalisation is not enough? ​

Should I use rule-based, statistical, or neural? ​

What about Latin, Greek and other classical languages? ​

How do I check it actually works? ​

Key Takeaways ​

Frequently Asked Questions ​

Can I use a modern POS tagger on historical text directly? ​

Which tagset should I use for historical languages? ​

How much annotated data do I need to train a historical tagger? ​

Do I need a GPU to POS tag historical text? ​

How do I evaluate a historical POS tagger? ​

Related reading ​