Appearance
Lemmatise Latin and Greek when your goal is to search, count, or compare concepts across inflected text — because a single lemma like amare surfaces as dozens of forms, and without lemmatisation your frequency counts and concordances scatter across all of them. Do not lemmatise when the morphology itself is your research question, or when accuracy on your specific corpus is too low to trust. The decision is about your question, not the technology.
Why does inflection make this matter so much?
Latin and Greek are richly inflected: one verb can have well over a hundred forms. Search for rex and you miss regem, regis, regi, regum. For any analysis that treats words as concepts — topic counts, keyword-in-context, collocations — surface forms fragment the signal. Lemmatisation collapses that variation back to the headword.
When is lemmatisation the wrong choice?
Skip it when:
- Morphology is the object of study. Case, mood, and aspect are your data.
- You study an author's specific word choice. Collapsing forms erases the rhetorical point.
- Accuracy is too low. On fragmentary or medieval text, a poor lemmatiser introduces more error than it removes.
- You only need exact-phrase search. Then a search index with morphological expansion may serve better than rewriting the text.
How accurate is it, really?
| Corpus type | Expected CLTK accuracy | Note |
|---|---|---|
| Well-edited classical | high 80s to low 90s | best case |
| Medieval Latin | lower | vocabulary and spelling drift |
| Neo-Latin | variable | new coinages |
| OCR-noisy | poor | normalise first |
Always measure on a sample of your texts. The headline accuracy from a benchmark on Caesar tells you little about a 13th-century cartulary.
How do I lemmatise with CLTK?
python
from cltk import NLP
nlp_lat = NLP(language="lat")
doc = nlp_lat.analyze(text="Gallia est omnis divisa in partes tres")
for w in doc.words:
print(w.string, "->", w.lemma, w.upos)CLTK chains a backoff lemmatiser (dictionary, then rules, then model). For Greek, swap language="grc". The morphological analysis it produces also helps disambiguation, which leads to the next point.
Should I tag before I lemmatise?
Usually yes. Identical surface forms can map to different lemmas — context decides. Running POS or full morphological analysis first lets the lemmatiser pick the right headword instead of guessing. The CLTK pipeline does this for you, but if you assemble your own, keep that order: analyse, then lemmatise.
What about medieval and Neo-Latin?
Classical lexicons miss medieval spellings (ci for ti, e for ae) and later vocabulary. Two practical fixes: normalise spelling toward classical forms before lemmatising, or extend the lexicon with a medieval wordlist such as those derived from the DMLBS tradition. Expect to validate more carefully here.
What is the cost-benefit summary?
Lemmatisation costs you compute time, some accuracy loss, and irreversible information if you overwrite the text. It buys you coherent counts, comparable concordances, and far better recall. Keep both the surface form and the lemma in your data so the cost is never irreversible.
Key Takeaways
- Lemmatise for concept-level search and counting across inflected text.
- Do not lemmatise when morphology or specific word choice is your subject.
- Use lemmatisation, never stemming, for Latin and Greek.
- Measure accuracy on your own corpus, not a published benchmark.
- POS/morphological analysis before lemmatising resolves many ambiguities.
- Medieval and Neo-Latin need normalisation and extended lexicons.
- Always keep the surface form alongside the lemma so nothing is lost.
Frequently Asked Questions
When should I lemmatise Latin or Greek rather than keep surface forms?
Lemmatise when you are searching or counting concepts across heavily inflected text, because Latin and Greek nouns and verbs have many forms per lemma. Keep surface forms when morphology itself is your object of study.
How accurate is CLTK lemmatisation for classical languages?
CLTK's backoff and model-based lemmatisers reach roughly the high 80s to low 90s in accuracy on well-edited classical texts, but drop on medieval, fragmentary or OCR-noisy material. Always spot-check on your own corpus.
Is lemmatisation the same as stemming for Latin and Greek?
No. Stemming chops affixes crudely and is nearly useless for these languages, while lemmatisation maps each form to its dictionary headword using morphological knowledge. For Latin and Greek, always lemmatise rather than stem.
Do I need to disambiguate before lemmatising?
Often yes, because identical surface forms can belong to different lemmas depending on context. POS tagging or morphological analysis before lemmatisation resolves many of these ambiguities.
Can I lemmatise medieval or Neo-Latin with the same tools?
Partly. Classical lexicons miss medieval spellings and vocabulary, so expect lower accuracy and supplement with a medieval wordlist or normalisation step before lemmatising.