Skip to content
Text Mining & Corpora

When collocations or keywords look wrong, the cause is almost always one of four things: function words swamping the list, OCR noise scoring as signal, a mismatched reference corpus, or a raw-frequency measure where you needed an association statistic. Fix them in that order — switch to log-likelihood, add a minimum-frequency threshold, clean the text, and match your reference corpus to the period — and the list usually becomes interpretable. The settings are sensitive by nature, so lock and document them.

Collocations are words that co-occur more than chance would predict within a span; keywords are words unusually frequent relative to a reference corpus. The two problems share the same failure modes, so the same troubleshooting checklist applies to both.

Why are my collocations just "the", "of" and "and"?

Because you are ranking by raw co-occurrence, and function words co-occur with everything. The fix is an association measure plus a frequency floor:

python
import nltk
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

tokens = open("derived/corpus.txt", encoding="utf-8").read().split()
finder = BigramCollocationFinder.from_words(tokens, window_size=4)
finder.apply_freq_filter(5)                 # drop pairs seen < 5 times
bam = BigramAssocMeasures()
for pair, score in finder.score_ngrams(bam.likelihood_ratio)[:15]:
    print(pair, round(score, 1))

apply_freq_filter removes the long tail of noise, and likelihood_ratio (log-likelihood) ranks by how surprising the pairing is, not how common. Both changes are needed; either alone leaves the list distorted.

Why do my keywords look like OCR garbage?

Keyword extraction by design surfaces what is over-represented in your corpus versus a reference. OCR errors like tbe and arid are over-represented in your noisy text and absent from a clean reference, so they shoot to the top. There is no statistical fix — clean upstream:

python
import re
DICT = set(open("wordlist.txt", encoding="utf-8").read().split())
def keep(tok):
    return tok.isalpha() and (tok.lower() in DICT or len(tok) > 2)
tokens = [t for t in tokens if keep(t)]

If garbage persists after cleaning, your text needs re-OCR, not a longer stop-list.

How do I pick the right reference corpus?

A mismatched reference is the most common cause of misleading keywords. Match it to the genre and period:

Reference choiceWhat it surfacesWhen to use
Modern web corpusEvery archaic spellingAlmost never for historical text
Contemporaneous balanced corpusYour topic's distinctive wordsDefault for historical keywords
The rest of your own corpusWhat sets one subset apartComparing periods/authors within a corpus

Using a 21st-century reference against 18th-century pamphlets produces a keyword list about language change, not your research question.

Which association measure should I trust?

Log-likelihood is the robust default; PMI surfaces tighter but rarer pairings and over-rewards hapax-like items.

python
# log-likelihood: stable, good general default
finder.score_ngrams(bam.likelihood_ratio)[:10]
# PMI: tight idioms, but watch low-frequency inflation
finder.score_ngrams(bam.pmi)[:10]

A practical rule: report log-likelihood for headline findings, and only reach for PMI when you specifically want fixed phrases and have already applied a frequency floor.

Why do results shift every time I change a setting?

Because they genuinely depend on four knobs: window size, frequency threshold, association measure and reference corpus. This sensitivity is a property of the method, not an error. Manage it instead of fighting it:

  • Fix the four settings in a config and version it.
  • Re-run with one setting nudged (window 4 to 5) and check the top results survive.
  • Report the settings alongside every list you publish.

A collocation that vanishes when the window changes from 4 to 5 was never robust enough to claim.

How do I sanity-check a final list before reporting it?

Trace the top items back to concordance lines. A keyword or collocation you cannot ground in actual sentences is a statistic, not a finding:

python
import re
def kwic(tokens, target, span=6):
    for i, t in enumerate(tokens):
        if t.lower() == target:
            left = " ".join(tokens[max(0, i-span):i])
            right = " ".join(tokens[i+1:i+1+span])
            print(f"{left:>40} [{t}] {right}")
kwic(tokens, "liberty")

If the lines do not support the interpretation, the number is misleading you.

Key Takeaways

  • Function-word collocations mean you need an association measure plus a frequency floor.
  • OCR-garbage keywords are an upstream cleaning problem, not a scoring one.
  • Match your reference corpus to the genre and period of the target.
  • Log-likelihood is the robust default; use PMI cautiously for fixed phrases.
  • Results are legitimately sensitive to settings — lock, document and stress-test them.
  • Ground every top result in concordance lines before claiming it.

Frequently Asked Questions

What is the difference between collocations and keywords?

Collocations are words that co-occur more than chance within a span, revealing fixed phrases and associations; keywords are words unusually frequent in your target corpus compared to a reference corpus. Collocations need one corpus; keywords always need a comparison corpus.

Why are my top collocations all function words?

Because raw co-occurrence counts are dominated by 'the', 'of' and 'and'. Switch from raw frequency to an association measure such as log-likelihood or PMI, and apply a minimum frequency threshold so rare noise does not inflate the scores.

Why do my keywords look like OCR garbage?

Keyword extraction surfaces whatever is unusually frequent, and OCR errors are unusually frequent in your corpus but absent from a clean reference. The fix is upstream: clean the text and exclude non-dictionary tokens before scoring.

What reference corpus should I use for keyword analysis?

Use a reference that matches the genre and period of your target as closely as possible; a modern web corpus will flag every archaic spelling as a keyword. A balanced sample of contemporaneous text gives results about your topic rather than about language change.

Which association measure should I choose?

Log-likelihood is a robust default for both keywords and collocations and handles low frequencies better than chi-squared; PMI surfaces tight, rarer pairings but over-rewards low-frequency items. Report which measure and threshold you used.

Why do results change every time I tweak settings?

Collocation and keyword results are sensitive to window size, frequency cut-off, the association measure and the reference corpus. That sensitivity is real, not a bug — lock the settings, document them, and check findings survive a reasonable change.