Troubleshooting: Mine text with tidytext

Q: Why are my most frequent words just 'the' and 'and'?

You have not removed stop words. anti_join() against tidytext's stop_words, but for historical text build a custom list that also strips archaic function words and OCR noise tokens.

Most tidytext problems trace back to four causes: wrong text encoding, missing stop-word removal, OCR noise swamping signal, or an empty or mis-grouped document column breaking tf-idf. Diagnose by inspecting your tokens immediately after unnest_tokens() with count(). If the top tokens are mojibake, stop words, or junk strings, you have found the root cause before writing another line.

Why did unnest_tokens mangle my accented text?

If é, ñ or the long-s arrive as garbage, the file was read in the wrong encoding. tidytext does not corrupt text; the read step did. Re-read explicitly:

library(tidyverse)
library(tidytext)

# Detect first if unsure
guess_encoding("data/letter.txt")

raw <- read_file("data/letter.txt",
                 locale = locale(encoding = "UTF-8"))

Then tokenise and look before trusting:

tibble(text = raw) |>
  unnest_tokens(word, text) |>
  count(word, sort = TRUE) |>
  head(20)

This single inspection catches encoding, stop-word and OCR problems at once.

Why are my top words all "the" and "and"?

You skipped stop-word removal. The base fix is an anti_join:

data(stop_words)
tokens |>
  anti_join(stop_words, by = "word")

For historical material the default English list is not enough. Archaic function words (hath, thee, unto) and recurring OCR noise need a custom list:

custom_stop <- tibble(word = c("hath", "thee", "thou", "unto", "ye"))
tokens |>
  anti_join(stop_words, by = "word") |>
  anti_join(custom_stop, by = "word")

How do you stop OCR garbage dominating?

Optical character recognition of old print produces thousands of unique error tokens. Filter to plausible words and drop the long tail:

clean <- tokens |>
  filter(str_detect(word, "^[a-zà-ÿ]+$")) |>  # letters only, keep accents
  filter(str_length(word) > 2) |>            # drop 1-2 char fragments
  add_count(word) |>
  filter(n >= 3)                              # drop near-unique noise

The frequency threshold is the workhorse: real vocabulary recurs, OCR errors usually do not.

Why does bind_tf_idf return NaN or Inf?

bind_tf_idf() divides by document frequency. Trouble appears in three situations:

Symptom	Root cause	Fix
`NaN` rows	Empty document (no tokens)	Filter out documents with zero tokens first
`Inf` or odd zeros	Term in every document	Expected; tf-idf is 0, not an error
All values identical	Wrong grouping column	Pass the correct `document` column

counts <- tokens |>
  count(document, word) |>
  filter(n > 0)

counts |>
  group_by(document) |>
  filter(sum(n) > 0) |>      # drop empty documents
  ungroup() |>
  bind_tf_idf(word, document, n)

Why is tokenising painfully slow?

Two anti-patterns dominate: tokenising the entire corpus as one giant frame, and carrying columns you never use. Trim first, then tokenise, and consider a faster backend:

corpus |>
  select(doc_id, text) |>             # keep only what you need
  unnest_tokens(word, text)

For corpora beyond a few million tokens, read with arrow or process documents in chunks rather than loading everything into one in-memory data frame.

How do you sanity-check a sentiment or n-gram result?

Before reporting, trace a sample back to the source. Sentiment lexicons are modern, so anachronistic scoring is a real risk on historical prose:

tokens |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(word, sentiment, sort = TRUE) |>
  head(15)

If words like gay or awful score with their modern sense, you have surfaced a semantic-change problem the numbers would otherwise hide.

Key Takeaways

Inspect tokens with count() right after unnest_tokens() to find the root cause fast.
Mojibake means a wrong read encoding, not a tidytext bug; re-read as UTF-8.
Remove stop words and build a custom list for archaic and OCR tokens.
A frequency threshold (n >= 3) clears the long tail of OCR noise.
NaN/Inf in tf-idf usually means empty documents or a wrong grouping column.
Trim columns and chunk large corpora to keep tokenising fast.
Trace sentiment results to the source to catch anachronistic scoring.

Frequently Asked Questions

Why does unnest_tokens lose my accented characters?

Your text was read in the wrong encoding. Re-read the file with read_file(..., locale = locale(encoding = 'UTF-8')) or guess_encoding() so accented and historical glyphs survive before tokenising.

Why are my most frequent words just "the" and "and"?

You have not removed stop words. anti_join() against tidytext's stop_words, but for historical text build a custom list that also strips archaic function words and OCR noise tokens.

Why does bind_tf_idf give NaN or Inf?

A term appears in zero or all documents, or a document is empty. Filter out empty documents and very rare tokens before computing tf-idf, and check your document grouping column is correct.

How do I stop OCR garbage dominating the results?

Filter tokens by a regex that keeps only plausible words, drop one and two character tokens, and remove anything with digits inside letters. A frequency-threshold filter removes the long tail of unique OCR errors.

Why is unnest_tokens so slow on my corpus?

You are probably tokenising before filtering, or holding the whole corpus in one giant data frame. Tokenise per document, keep only needed columns, and use the data.table or arrow backend for very large corpora.

Why did unnest_tokens mangle my accented text? ​

Why are my top words all "the" and "and"? ​

How do you stop OCR garbage dominating? ​

Why does bind_tf_idf return NaN or Inf? ​

Why is tokenising painfully slow? ​

How do you sanity-check a sentiment or n-gram result? ​

Key Takeaways ​

Frequently Asked Questions ​

Why does unnest_tokens lose my accented characters? ​

Why are my most frequent words just "the" and "and"? ​

Why does bind_tf_idf give NaN or Inf? ​

How do I stop OCR garbage dominating the results? ​

Why is unnest_tokens so slow on my corpus? ​

Related reading ​