When to Lemmatise Latin and Greek

Lemmatise Latin and Greek when your goal is to search, count, or compare concepts across inflected text — because a single lemma like amare surfaces as dozens of forms, and without lemmatisation your frequency counts and concordances scatter across all of them. Do not lemmatise when the morphology itself is your research question, or when accuracy on your specific corpus is too low to trust. The decision is about your question, not the technology.

Why does inflection make this matter so much?

Latin and Greek are richly inflected: one verb can have well over a hundred forms. Search for rex and you miss regem, regis, regi, regum. For any analysis that treats words as concepts — topic counts, keyword-in-context, collocations — surface forms fragment the signal. Lemmatisation collapses that variation back to the headword.

When is lemmatisation the wrong choice?

Skip it when:

Morphology is the object of study. Case, mood, and aspect are your data.
You study an author's specific word choice. Collapsing forms erases the rhetorical point.
Accuracy is too low. On fragmentary or medieval text, a poor lemmatiser introduces more error than it removes.
You only need exact-phrase search. Then a search index with morphological expansion may serve better than rewriting the text.

How accurate is it, really?

Corpus type	Expected CLTK accuracy	Note
Well-edited classical	high 80s to low 90s	best case
Medieval Latin	lower	vocabulary and spelling drift
Neo-Latin	variable	new coinages
OCR-noisy	poor	normalise first

Always measure on a sample of your texts. The headline accuracy from a benchmark on Caesar tells you little about a 13th-century cartulary.

How do I lemmatise with CLTK?

python

from cltk import NLP

nlp_lat = NLP(language="lat")
doc = nlp_lat.analyze(text="Gallia est omnis divisa in partes tres")
for w in doc.words:
    print(w.string, "->", w.lemma, w.upos)

CLTK chains a backoff lemmatiser (dictionary, then rules, then model). For Greek, swap language="grc". The morphological analysis it produces also helps disambiguation, which leads to the next point.

Should I tag before I lemmatise?

Usually yes. Identical surface forms can map to different lemmas — context decides. Running POS or full morphological analysis first lets the lemmatiser pick the right headword instead of guessing. The CLTK pipeline does this for you, but if you assemble your own, keep that order: analyse, then lemmatise.

What about medieval and Neo-Latin?

Classical lexicons miss medieval spellings (ci for ti, e for ae) and later vocabulary. Two practical fixes: normalise spelling toward classical forms before lemmatising, or extend the lexicon with a medieval wordlist such as those derived from the DMLBS tradition. Expect to validate more carefully here.

What is the cost-benefit summary?

Lemmatisation costs you compute time, some accuracy loss, and irreversible information if you overwrite the text. It buys you coherent counts, comparable concordances, and far better recall. Keep both the surface form and the lemma in your data so the cost is never irreversible.

Key Takeaways

Lemmatise for concept-level search and counting across inflected text.
Do not lemmatise when morphology or specific word choice is your subject.
Use lemmatisation, never stemming, for Latin and Greek.
Measure accuracy on your own corpus, not a published benchmark.
POS/morphological analysis before lemmatising resolves many ambiguities.
Medieval and Neo-Latin need normalisation and extended lexicons.
Always keep the surface form alongside the lemma so nothing is lost.

Frequently Asked Questions

When should I lemmatise Latin or Greek rather than keep surface forms?

Lemmatise when you are searching or counting concepts across heavily inflected text, because Latin and Greek nouns and verbs have many forms per lemma. Keep surface forms when morphology itself is your object of study.

How accurate is CLTK lemmatisation for classical languages?

CLTK's backoff and model-based lemmatisers reach roughly the high 80s to low 90s in accuracy on well-edited classical texts, but drop on medieval, fragmentary or OCR-noisy material. Always spot-check on your own corpus.

Is lemmatisation the same as stemming for Latin and Greek?

No. Stemming chops affixes crudely and is nearly useless for these languages, while lemmatisation maps each form to its dictionary headword using morphological knowledge. For Latin and Greek, always lemmatise rather than stem.

Do I need to disambiguate before lemmatising?

Often yes, because identical surface forms can belong to different lemmas depending on context. POS tagging or morphological analysis before lemmatisation resolves many of these ambiguities.

Can I lemmatise medieval or Neo-Latin with the same tools?

Partly. Classical lexicons miss medieval spellings and vocabulary, so expect lower accuracy and supplement with a medieval wordlist or normalisation step before lemmatising.

Why does inflection make this matter so much? ​

When is lemmatisation the wrong choice? ​

How accurate is it, really? ​

How do I lemmatise with CLTK? ​

Should I tag before I lemmatise? ​

What about medieval and Neo-Latin? ​

What is the cost-benefit summary? ​

Key Takeaways ​

Frequently Asked Questions ​

When should I lemmatise Latin or Greek rather than keep surface forms? ​

How accurate is CLTK lemmatisation for classical languages? ​

Is lemmatisation the same as stemming for Latin and Greek? ​

Do I need to disambiguate before lemmatising? ​

Can I lemmatise medieval or Neo-Latin with the same tools? ​

Related reading ​