Troubleshooting: Build topic models in R

Q: How many topics (K) should I choose?

There is no single correct K. Use searchK() across a range like 10 to 60, read the held-out likelihood and semantic coherence curves, then pick a value where coherence is high and the topics are humanly interpretable.

Q: Should I use stm, topicmodels, or text2vec?

Use stm when you have document metadata (date, genre, author) you want as covariates; topicmodels (LDA/CTM) for a classic baseline; text2vec when speed on a large corpus matters more than covariate effects.

Q: Why are my topics unstable between runs?

Topic models are stochastic. Set a seed in stm() and use spectral initialisation (init.type = 'Spectral') which is deterministic and usually converges to better, more reproducible solutions than random starts.

When a topic model in R produces nonsense, the cause is almost always upstream of the model: dirty text, empty documents, a badly chosen K, or a missing seed. Fix preprocessing first, align metadata to the documents that actually survive prepDocuments(), then tune K with searchK(). This guide walks the failures I hit most often with the stm package on historical corpora and the fixes that hold.

Why are all my topics just stopwords and noise?

This is a preprocessing failure, full stop. If you feed raw text in, the highest-probability words in every topic will be "the", "and", "of" — or, on OCR'd material, garbage tokens like "tbe" and "wbich".

library(stm)

processed <- textProcessor(
  documents = corpus$text,
  metadata  = corpus,
  lowercase = TRUE,
  removestopwords = TRUE,
  removenumbers = TRUE,
  stem = FALSE,                      # stemming hurts interpretability on historical text
  customstopwords = c("hath", "thy", "thou", "tbe", "wbich")
)
out <- prepDocuments(processed$documents, processed$vocab,
                     processed$meta, lower.thresh = 2)

Set lower.thresh = 2 so terms in only one document are dropped — those single-document terms inflate the vocabulary and produce incoherent, junk-only topics.

How do I fix "contains missing values" from prepDocuments?

After cleaning, some documents become empty and are removed. If you fit stm() against your original metadata frame, the covariate rows no longer line up with the documents.

out <- prepDocuments(processed$documents, processed$vocab, processed$meta,
                     lower.thresh = 2)
# out$meta is already re-aligned — USE IT, not the original frame
fit <- stm(out$documents, out$vocab, K = 30,
           prevalence = ~ decade, data = out$meta,
           init.type = "Spectral", seed = 1851)

Always pass out$meta to stm(). The processed$docs.removed and out$docs.removed indices tell you exactly which rows vanished if you need to debug a mismatch.

How do I choose K without guessing?

Run searchK() over a range and read two curves: held-out likelihood (predictive fit) and semantic coherence (do top words co-occur?). They trade off against exclusivity.

k_search <- searchK(out$documents, out$vocab,
                    K = c(10, 20, 30, 40, 50),
                    prevalence = ~ decade, data = out$meta)
plot(k_search)

Symptom	Likely cause	Fix
Topics merge distinct themes	`K` too low	Increase `K`, re-read coherence
Many near-duplicate topics	`K` too high	Lower `K` or raise `lower.thresh`
Coherence high, sense low	Boilerplate co-occurrence	Strip headers/OCR junk, refit
Different topics each run	Random init, no seed	`init.type = "Spectral"`, set `seed`

Why do my topics change every time I run the model?

Topic models are stochastic. With random initialisation, each run lands in a different local optimum. Use spectral initialisation, which is deterministic, and still set a seed so the whole pipeline is reproducible. Spectral starts also tend to converge faster and to more stable solutions on medium corpora.

How do I read a model once it fits?

Do not trust the top-words list alone. Use labelTopics() for FREX words (frequent and exclusive), then read real documents with findThoughts().

labelTopics(fit, n = 8)
findThoughts(fit, texts = out$meta$title, topics = 7, n = 3)

Reading three actual documents per topic catches the "coherent but meaningless" trap that pure metrics miss.

Key Takeaways

Most bad topic models are bad preprocessing — clean before you blame the model.
Set lower.thresh = 2 or higher to drop single-document junk terms.
Always fit against out$meta, the re-aligned metadata from prepDocuments().
Choose K with searchK(), balancing coherence, exclusivity and interpretability.
Use init.type = "Spectral" plus a seed for reproducible, stable topics.
Validate by reading real documents with findThoughts(), not just top-word lists.
On OCR'd corpora, custom stopwords for scanning artefacts dramatically improve results.

Frequently Asked Questions

Why does my stm model produce topics that are all stopwords?

Your preprocessing did not remove function words or low-frequency junk. Run textProcessor() with a stopword list and then prepDocuments() with a lower.thresh of at least 2 to drop terms appearing in only one document.

How many topics (K) should I choose?

There is no single correct K. Use searchK() across a range like 10 to 60, read the held-out likelihood and semantic coherence curves, then pick a value where coherence is high and the topics are humanly interpretable.

Why do I get "Error: contains missing values" from prepDocuments?

Empty documents survive preprocessing and become zero-length. prepDocuments() returns a docs.removed index — re-align your metadata to the kept documents before fitting or the covariate rows will not match.

Should I use stm, topicmodels, or text2vec?

Use stm when you have document metadata (date, genre, author) you want as covariates; topicmodels (LDA/CTM) for a classic baseline; text2vec when speed on a large corpus matters more than covariate effects.

Why are my topics unstable between runs?

Topic models are stochastic. Set a seed in stm() and use spectral initialisation (init.type = "Spectral") which is deterministic and usually converges to better, more reproducible solutions than random starts.

My coherence is high but the topics make no sense — what now?

High coherence with low usefulness usually means residual boilerplate (headers, OCR garbage, catalogue text) co-occurs predictably. Clean those artefacts and consider raising lower.thresh, then refit.

Why are all my topics just stopwords and noise? ​

How do I fix "contains missing values" from prepDocuments? ​

How do I choose K without guessing? ​

Why do my topics change every time I run the model? ​

How do I read a model once it fits? ​

Key Takeaways ​

Frequently Asked Questions ​

Why does my stm model produce topics that are all stopwords? ​

How many topics (K) should I choose? ​

Why do I get "Error: contains missing values" from prepDocuments? ​

Should I use stm, topicmodels, or text2vec? ​

Why are my topics unstable between runs? ​

My coherence is high but the topics make no sense — what now? ​

Related reading ​