Beginner's Guide to Historical text

Lemmatisation reduces every inflected form of a word to its dictionary headword, so loved, loves and loving all collapse to love. For historical text the catch is that spelling was not standardised, so you almost always normalise variants like loue to love first, then lemmatise. This short guide walks from that idea to a small worked example you can run from a standing start.

What is lemmatisation, and how is it different from stemming?

Both techniques shrink word forms so related words count together, but they work differently. Stemming chops letters off the end with simple rules, often producing non-words: studies becomes studi. Lemmatisation consults a dictionary and the word's part of speech to return a genuine headword: studies becomes study. For historical scholarship, where you want interpretable counts, lemmatisation is almost always the better choice.

Why is historical text the hard case?

A modern lemmatiser knows that does is a form of do. It does not know that the 1600s doth is too, because doth is not in its vocabulary. Historical material multiplies this problem: variant spellings (publick/public), obsolete inflections (-eth, -est), and long-s scanning errors all make familiar words look foreign. So the realistic pipeline has two stages, not one: normalise the spelling, then lemmatise.

How do I lemmatise a small sample, step by step?

Here is a minimal early-modern-English example with spaCy. First, a tiny normalisation map stands in for a fuller variant list; then spaCy supplies the lemmas.

python

import spacy
nlp = spacy.load("en_core_web_sm")

variants = {"loue": "love", "doth": "does", "hath": "has", "vpon": "upon"}

raw = "She loue him and doth speake vpon it"
normalised = " ".join(variants.get(w, w) for w in raw.lower().split())

doc = nlp(normalised)
for token in doc:
    print(f"{token.text:>10} -> {token.lemma_}")

You would see loves -> love, does -> do, speak -> speak and so on. The lesson is that the normalisation step is what makes the lemmatiser succeed; feed it loue directly and it fails.

Should I always normalise spelling first?

For the great majority of historical English and other vernaculars, yes. Most lemmatisers are trained on modern orthography and stumble on archaic forms. Normalising first, using a variant dictionary or a tool like VARD, lets the lemmatiser do its job. Crucially, keep the original text in a parallel column so you never discard the surface form, which may itself be your evidence later.

Which tools fit which languages?

Language	Tool	Note
Early modern English	spaCy + normalisation	normalise variants before lemmatising
Latin	CLTK	dedicated classical lemmatiser
Ancient Greek	CLTK	handles rich inflection
Middle High German	period models / RNN taggers	generic modern tools fail

The recurring rule is that a model trained on, or adapted to, the right period and language outperforms a general modern one every time.

When should I not lemmatise at all?

Lemmatisation throws away inflection on purpose, which is sometimes the wrong move. If you are studying rhyme, metre, exact spelling variation, or the evolution of a particular form, lemmatising erases the very thing you care about. Lemmatise when your question is about meaning and frequency ("how often does the concept of love appear"), and leave the text alone when your question is about surface form.

How do I check whether my lemmatisation is any good?

Spot-check, do not trust silently. Take fifty random tokens, look at the lemma assigned, and tally how many are correct. If accuracy is poor, the usual fixes are: expand your normalisation dictionary, supply part-of-speech tags so the lemmatiser disambiguates lead (verb) from lead (metal), or switch to a period-appropriate model. A five-minute manual audit catches systematic errors that would otherwise silently distort every downstream count.

Key Takeaways

Lemmatisation maps inflected forms to a real dictionary headword; stemming just chops endings.
Historical spelling variation means you normalise first, then lemmatise.
A small variant dictionary plus spaCy handles much early modern English.
Always keep the original surface form beside the normalised text.
Use period- and language-specific tools (CLTK for Latin/Greek) over generic modern ones.
Do not lemmatise when surface form, rhyme, or exact spelling is your evidence.

Frequently Asked Questions

What is lemmatisation in plain terms?

Lemmatisation reduces an inflected word form to its dictionary headword, so 'running', 'ran' and 'runs' all become 'run'. Unlike stemming, it returns a real word and uses knowledge of grammar rather than chopping off endings.

Why is lemmatising historical text harder than modern text?

Historical spelling is unstandardised, so 'loue' and 'love' or 'doth' and 'does' look like different words to a modern lemmatiser. You usually normalise spelling first, or use a tool trained on the relevant period and language.

Should I normalise spelling before lemmatising?

Usually yes. Most lemmatisers expect modern orthography, so mapping historical variants to a standard form first dramatically improves accuracy. Keep the original alongside the normalised version so nothing is lost.

What is the difference between stemming and lemmatisation?

Stemming crudely strips suffixes and can produce non-words like 'studi'. Lemmatisation uses a dictionary and part-of-speech information to return a valid headword, which is what most historical analysis needs.

Which tools lemmatise historical English or Latin?

For English, spaCy combined with a normalisation step works for early modern text; for Latin and other classical languages, CLTK provides dedicated lemmatisers. Period-specific models always beat a generic modern one.

Do I always need to lemmatise?

No. If you study exact spellings, rhyme, or surface variation, lemmatising would destroy your evidence. Lemmatise when you want to count meanings rather than forms, such as for topic or frequency analysis.

What is lemmatisation, and how is it different from stemming? ​

Why is historical text the hard case? ​

How do I lemmatise a small sample, step by step? ​

Should I always normalise spelling first? ​

Which tools fit which languages? ​

When should I not lemmatise at all? ​

How do I check whether my lemmatisation is any good? ​

Key Takeaways ​

Frequently Asked Questions ​

What is lemmatisation in plain terms? ​

Why is lemmatising historical text harder than modern text? ​

Should I normalise spelling before lemmatising? ​

What is the difference between stemming and lemmatisation? ​

Which tools lemmatise historical English or Latin? ​

Do I always need to lemmatise? ​

Related reading ​