Topic model a historical corpus: A Practical Guide

To topic model a historical corpus, build a clean document-term matrix, run LDA (MALLET or Gensim) across several topic counts, score the runs with coherence, then validate the winning model by reading the top documents per topic. Budget more time for vocabulary cleaning and interpretation than for the modelling itself — the algorithm runs in minutes; making it trustworthy takes days.

Topic modelling discovers latent themes in a collection by grouping words that co-occur and expressing each document as a mixture of those themes. It is the workhorse of cultural analytics because it scales to thousands of documents and needs no prior labels.

What does the end-to-end workflow look like?

Five stages, in order:

Assemble texts as one file per document with a metadata table.
Preprocess — tokenise, remove stopwords, optionally lemmatise.
Build a document-term matrix with frequency filtering.
Model across a range of topic counts.
Validate quantitatively and by close reading.

Skipping stage 5 is the single most common failure in published historical topic models.

How do I prepare the document-term matrix?

The matrix is where topic quality is won or lost. Filter aggressively to remove noise and ultra-rare tokens:

python

from gensim import corpora

texts = [lemmatise(open(f, encoding="utf-8").read()) for f in files]
dictionary = corpora.Dictionary(texts)
# drop words in <5 docs or >50% of docs, cap vocabulary
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=20000)
corpus = [dictionary.doc2bow(t) for t in texts]

The no_below=5 filter is your main defence against OCR garbage: a misrecognised token rarely repeats in five separate documents. Tune no_above to drop boilerplate that appears everywhere (page headers, archival stamps).

Which algorithm and settings should I use?

For interpretable historical work, MALLET (Java) or Gensim (Python) running LDA remain the standard.

Tool	Strength	Use when
MALLET	Best topic quality, hyperparameter optimisation	You want defensible, publishable topics
Gensim LDA	Pure Python, scriptable	You stay inside a Python pipeline
BERTopic	Context-aware, short texts	You have a GPU and a technical audience
scikit-learn NMF	Fast, deterministic	Quick exploratory passes

MALLET's hyperparameter optimisation (--optimize-interval 20) lets topic proportions vary, which usually improves coherence on uneven historical corpora. Always set and record a random seed so the run is reproducible.

How many topics should I pick?

Sweep and score, do not guess. Run several models and compute the C_v coherence metric:

python

from gensim.models import LdaModel, CoherenceModel

for k in [10, 20, 40, 60]:
    lda = LdaModel(corpus, num_topics=k, id2word=dictionary,
                   passes=10, random_state=42)
    cm = CoherenceModel(model=lda, texts=texts,
                        dictionary=dictionary, coherence="c_v")
    print(k, round(cm.get_coherence(), 3))

Pick the elbow of the coherence curve, then sanity-check that you can name each topic. A model with 60 topics and the highest coherence is useless if a quarter of its topics are unnameable.

How do I read and validate the results?

For every topic, list its top 15 words and its top 10 documents by topic weight, then read those documents. A genuine topic shows the same theme in both. Cross-tabulate topic proportions against your metadata — plot topic weight by decade — to surface the historical story.

Watch for the junk topic that absorbs all the OCR noise and stopword residue; its presence is normal and actually protects the other topics. Document it, exclude it from interpretation, and move on.

What pitfalls should I plan around?

Unbalanced corpora: if 80% of documents are from one decade, topics will reflect that decade, not the period.
Mixed languages in one model produce language topics, not theme topics — split first.
Reading topic labels as truth — the words are evidence, your label is an interpretation.
No seed, so the model is unreproducible and reviewers cannot check it.

Key Takeaways

Topic modelling finds latent themes and expresses each document as a mixture of them.
Vocabulary filtering (no_below, no_above) is your main control over OCR noise.
MALLET or Gensim LDA remain the standard for interpretable historical work.
Choose topic count by sweeping and scoring coherence, not by guessing.
Validate every topic by reading its top-weighted documents.
Always set a random seed and record it for reproducibility.

Frequently Asked Questions

What is a topic model, plainly?

A topic model is an unsupervised algorithm that groups words that tend to co-occur into 'topics', and describes each document as a mixture of those topics. It finds latent themes without you labelling anything in advance.

How many topics should I choose?

There is no correct number; it is a research choice. Start with a sweep of 10, 20, 40 and 60 topics, score each with coherence (C_v), and pick the model whose topics you can actually name and defend.

Is LDA still the right algorithm in 2025?

LDA via MALLET or Gensim remains the standard for interpretable, reproducible historical work. Newer neural models like BERTopic capture context better but are harder to explain to a humanities audience and need GPUs for large corpora.

Why are my topics full of OCR garbage?

Topic models surface whatever co-occurs, including systematic OCR errors. Filter the vocabulary by minimum document frequency, drop tokens shorter than three characters, and remove a custom stopword list of common OCR fragments.

Should I lemmatise before topic modelling?

Usually yes for inflected languages, because it merges 'king', 'kings' and 'king's' into one feature and sharpens topics. For English, lemmatisation gives modest gains; consistent stopword removal matters more.

How do I validate a topic model?

Combine quantitative coherence scores with qualitative reading: for each topic, read the top documents by topic weight and confirm they share a real theme. A high coherence score on incoherent documents is meaningless.

What does the end-to-end workflow look like? ​

How do I prepare the document-term matrix? ​

Which algorithm and settings should I use? ​

How many topics should I pick? ​

How do I read and validate the results? ​

What pitfalls should I plan around? ​

Key Takeaways ​

Frequently Asked Questions ​

What is a topic model, plainly? ​

How many topics should I choose? ​

Is LDA still the right algorithm in 2025? ​

Why are my topics full of OCR garbage? ​

Should I lemmatise before topic modelling? ​

How do I validate a topic model? ​

Related reading ​