Appearance
When a topic model in R produces nonsense, the cause is almost always upstream of the model: dirty text, empty documents, a badly chosen K, or a missing seed. Fix preprocessing first, align metadata to the documents that actually survive prepDocuments(), then tune K with searchK(). This guide walks the failures I hit most often with the stm package on historical corpora and the fixes that hold.
Why are all my topics just stopwords and noise?
This is a preprocessing failure, full stop. If you feed raw text in, the highest-probability words in every topic will be "the", "and", "of" — or, on OCR'd material, garbage tokens like "tbe" and "wbich".
r
library(stm)
processed <- textProcessor(
documents = corpus$text,
metadata = corpus,
lowercase = TRUE,
removestopwords = TRUE,
removenumbers = TRUE,
stem = FALSE, # stemming hurts interpretability on historical text
customstopwords = c("hath", "thy", "thou", "tbe", "wbich")
)
out <- prepDocuments(processed$documents, processed$vocab,
processed$meta, lower.thresh = 2)Set lower.thresh = 2 so terms in only one document are dropped — those single-document terms inflate the vocabulary and produce incoherent, junk-only topics.
How do I fix "contains missing values" from prepDocuments?
After cleaning, some documents become empty and are removed. If you fit stm() against your original metadata frame, the covariate rows no longer line up with the documents.
r
out <- prepDocuments(processed$documents, processed$vocab, processed$meta,
lower.thresh = 2)
# out$meta is already re-aligned — USE IT, not the original frame
fit <- stm(out$documents, out$vocab, K = 30,
prevalence = ~ decade, data = out$meta,
init.type = "Spectral", seed = 1851)Always pass out$meta to stm(). The processed$docs.removed and out$docs.removed indices tell you exactly which rows vanished if you need to debug a mismatch.
How do I choose K without guessing?
Run searchK() over a range and read two curves: held-out likelihood (predictive fit) and semantic coherence (do top words co-occur?). They trade off against exclusivity.
r
k_search <- searchK(out$documents, out$vocab,
K = c(10, 20, 30, 40, 50),
prevalence = ~ decade, data = out$meta)
plot(k_search)| Symptom | Likely cause | Fix |
|---|---|---|
| Topics merge distinct themes | K too low | Increase K, re-read coherence |
| Many near-duplicate topics | K too high | Lower K or raise lower.thresh |
| Coherence high, sense low | Boilerplate co-occurrence | Strip headers/OCR junk, refit |
| Different topics each run | Random init, no seed | init.type = "Spectral", set seed |
Why do my topics change every time I run the model?
Topic models are stochastic. With random initialisation, each run lands in a different local optimum. Use spectral initialisation, which is deterministic, and still set a seed so the whole pipeline is reproducible. Spectral starts also tend to converge faster and to more stable solutions on medium corpora.
How do I read a model once it fits?
Do not trust the top-words list alone. Use labelTopics() for FREX words (frequent and exclusive), then read real documents with findThoughts().
r
labelTopics(fit, n = 8)
findThoughts(fit, texts = out$meta$title, topics = 7, n = 3)Reading three actual documents per topic catches the "coherent but meaningless" trap that pure metrics miss.
Key Takeaways
- Most bad topic models are bad preprocessing — clean before you blame the model.
- Set
lower.thresh = 2or higher to drop single-document junk terms. - Always fit against
out$meta, the re-aligned metadata fromprepDocuments(). - Choose
KwithsearchK(), balancing coherence, exclusivity and interpretability. - Use
init.type = "Spectral"plus aseedfor reproducible, stable topics. - Validate by reading real documents with
findThoughts(), not just top-word lists. - On OCR'd corpora, custom stopwords for scanning artefacts dramatically improve results.
Frequently Asked Questions
Why does my stm model produce topics that are all stopwords?
Your preprocessing did not remove function words or low-frequency junk. Run textProcessor() with a stopword list and then prepDocuments() with a lower.thresh of at least 2 to drop terms appearing in only one document.
How many topics (K) should I choose?
There is no single correct K. Use searchK() across a range like 10 to 60, read the held-out likelihood and semantic coherence curves, then pick a value where coherence is high and the topics are humanly interpretable.
Why do I get "Error: contains missing values" from prepDocuments?
Empty documents survive preprocessing and become zero-length. prepDocuments() returns a docs.removed index — re-align your metadata to the kept documents before fitting or the covariate rows will not match.
Should I use stm, topicmodels, or text2vec?
Use stm when you have document metadata (date, genre, author) you want as covariates; topicmodels (LDA/CTM) for a classic baseline; text2vec when speed on a large corpus matters more than covariate effects.
Why are my topics unstable between runs?
Topic models are stochastic. Set a seed in stm() and use spectral initialisation (init.type = "Spectral") which is deterministic and usually converges to better, more reproducible solutions than random starts.
My coherence is high but the topics make no sense — what now?
High coherence with low usefulness usually means residual boilerplate (headers, OCR garbage, catalogue text) co-occurs predictably. Clean those artefacts and consider raising lower.thresh, then refit.