When to Compare two corpora statistically

Q: When should I compare two corpora statistically instead of just reading them?

Use a statistical comparison when you have a precise contrast question, two corpora that are comparable in genre and register, and enough text per side (roughly 100,000+ tokens) for frequencies to stabilise. If you only have a few documents or the question is interpretive rather than quantitative, close reading is the better tool.

Q: What is the difference between keyness and a simple frequency difference?

A raw frequency difference just subtracts counts and is dominated by corpus size and a handful of very common words. Keyness applies a statistical test (log-likelihood or log-ratio) that asks whether a word is over-represented in one corpus relative to the other beyond what chance would predict, and reports an effect size you can rank and filter.

Q: Why is log-likelihood not enough on its own?

Log-likelihood (G2) measures statistical significance, which grows with corpus size, so in large corpora almost everything looks significant. Always pair it with an effect-size measure such as log-ratio so you can distinguish a large, meaningful difference from a tiny one that is merely detectable.

Q: Do my two corpora need to be the same size to compare them?

No. Proper keyness measures normalise by corpus size, so unequal totals are fine. What must match is the genre, register, and ideally the time span, because differences in document type can swamp the contrast you actually care about.

Q: What sample size do I need for a reliable corpus comparison?

There is no universal threshold, but rare words behave badly below a few hundred occurrences. As a rule of thumb, aim for at least 100,000 tokens per corpus for general keyness, and only trust comparisons for words that appear several times in both corpora rather than once or twice.

Q: How do I avoid being fooled by topic instead of style?

Decide in advance whether you are comparing content or style. To compare style, restrict the analysis to function words or part-of-speech patterns; to compare content, lemmatise and remove stopwords. Mixing the two lets a single dominant topic masquerade as a structural difference.

Compare two corpora statistically when you have a sharp contrast question (does pamphlet rhetoric differ before and after 1789?), two collections that match in genre and register, and enough text — roughly 100,000 tokens per side — for word frequencies to stabilise. If any of those three conditions is missing, a statistical comparison will produce confident-looking numbers that answer the wrong question. The technique is powerful precisely because it is narrow: it tells you which words distinguish A from B, not what either corpus means.

When does a statistical comparison actually fit?

The method earns its keep when the difference you suspect is real but too diffuse to spot by eye. Reading ten letters, you might sense one author is more formal; reading a thousand, you cannot hold the pattern in your head. Keyness analysis surfaces the over-represented words that drive that impression and ranks them by effect size.

Good fit looks like this:

A binary or near-binary contrast (Whig vs Tory, early vs late, translated vs original).
Comparable units: both sides are sermons, or both are novels, not sermons vs tweets.
Reasonable volume per side so that mid-frequency words occur dozens of times.

Poor fit looks like a handful of documents, an open-ended "what is interesting here?" question, or corpora that differ on three variables at once so you cannot attribute any result.

How do you actually run the comparison?

The workhorse measures are log-likelihood (G2) for significance and log-ratio for effect size. The standard recipe: build a contingency table per word, compute both, then rank by log-ratio and filter by a significance floor.

python

import math

def log_likelihood(a, b, c, d):
    # a,b = freq of word in corpus1, corpus2
    # c,d = total tokens in corpus1, corpus2
    e1 = c * (a + b) / (c + d)
    e2 = d * (a + b) / (c + d)
    ll = 0.0
    if a: ll += a * math.log(a / e1)
    if b: ll += b * math.log(b / e2)
    return 2 * ll

def log_ratio(a, b, c, d):
    # +0.5 smoothing avoids log(0) for words absent on one side
    r1 = (a + 0.5) / c
    r2 = (b + 0.5) / d
    return math.log2(r1 / r2)

In practice you would reach for a maintained tool rather than hand-rolling: quanteda's textstat_keyness() in R, or AntConc's Keyword tool, both of which expose the same statistics with built-in normalisation.

Why is log-likelihood not enough on its own?

Because significance scales with size. Feed a million-word corpus to G2 and nearly every function word clears the p < 0.0001 bar — the test is detecting that the difference is real, not that it is large. That is why you sort by log-ratio (the effect size) and use log-likelihood only as a gate. A word with a log-ratio of 3.0 (eight times more frequent on one side) and a comfortable significance value is worth a sentence in your paper; a word that is significant but has a log-ratio near zero is statistical noise dressed as a finding.

Measure	Answers	Scales with corpus size?	Use it to
Raw difference	Which counts differ	Yes (badly)	Almost nothing on its own
Log-likelihood (G2)	Is the difference real	Yes	Filter / gate results
Log-ratio	How big is the difference	No	Rank and report
%DIFF	Relative % change	No	Sanity-check log-ratio

What are the hidden costs and trade-offs?

The chief cost is comparability. Two corpora that differ in genre will hand you a keyness list dominated by genre markers, and you will mistake artefacts for findings. The second cost is the rare-word trap: words appearing once or twice produce wild log-ratios that look dramatic and mean nothing. Set a minimum frequency (often 5–10 in each corpus) before ranking.

There is also an interpretive cost. A keyness list is a prompt for reading, not a conclusion. Every top word should send you back to concordance lines to confirm it means what you assume — "state" the noun and "state" the verb are very different stories.

How do you avoid comparing topic when you meant style?

Decide up front. For content comparison, lemmatise and strip stopwords so themes surface. For style comparison, do the opposite: keep only function words or part-of-speech n-grams, because the unconscious grammar of an author or period is what carries stylistic signal. Mixing them lets one dominant subject — a war, a trial — flood the list and disguise itself as a structural difference. If your two corpora also differ in topic, consider matching or stratifying documents first so the topic variable is held roughly constant.

When should you not do this at all?

Skip the statistics when the corpus is small enough to read, when the question is genuinely interpretive ("how does grief feel in these diaries?"), or when your two collections differ on so many axes that no single result is attributable. In those cases, distant reading adds a false veneer of rigour. Reach for close reading, or first invest in better sampling and metadata so a future comparison can be honest.

Key Takeaways

Compare statistically only with a sharp contrast question, comparable genres, and ~100k+ tokens per side.
Use log-likelihood as a significance gate, but rank and report by log-ratio (effect size).
Normalise by size — corpora need not be equal in length, but must match in register.
Set a minimum frequency to dodge the rare-word trap; one-off words produce meaningless ratios.
Choose content vs style deliberately: stopwords in for style, out for content.
Treat keyness lists as reading prompts; verify each top word in its concordance.
If the corpus is small or the question is interpretive, close reading beats statistics.

Frequently Asked Questions

When should I compare two corpora statistically instead of just reading them?

Use a statistical comparison when you have a precise contrast question, two corpora that are comparable in genre and register, and enough text per side (roughly 100,000+ tokens) for frequencies to stabilise. If you only have a few documents or the question is interpretive rather than quantitative, close reading is the better tool.

What is the difference between keyness and a simple frequency difference?

A raw frequency difference just subtracts counts and is dominated by corpus size and a handful of very common words. Keyness applies a statistical test (log-likelihood or log-ratio) that asks whether a word is over-represented in one corpus relative to the other beyond what chance would predict, and reports an effect size you can rank and filter.

Why is log-likelihood not enough on its own?

Log-likelihood (G2) measures statistical significance, which grows with corpus size, so in large corpora almost everything looks significant. Always pair it with an effect-size measure such as log-ratio so you can distinguish a large, meaningful difference from a tiny one that is merely detectable.

Do my two corpora need to be the same size to compare them?

No. Proper keyness measures normalise by corpus size, so unequal totals are fine. What must match is the genre, register, and ideally the time span, because differences in document type can swamp the contrast you actually care about.

What sample size do I need for a reliable corpus comparison?

There is no universal threshold, but rare words behave badly below a few hundred occurrences. As a rule of thumb, aim for at least 100,000 tokens per corpus for general keyness, and only trust comparisons for words that appear several times in both corpora rather than once or twice.

How do I avoid being fooled by topic instead of style?

Decide in advance whether you are comparing content or style. To compare style, restrict the analysis to function words or part-of-speech patterns; to compare content, lemmatise and remove stopwords. Mixing the two lets a single dominant topic masquerade as a structural difference.

When does a statistical comparison actually fit? ​

How do you actually run the comparison? ​

Why is log-likelihood not enough on its own? ​

What are the hidden costs and trade-offs? ​

How do you avoid comparing topic when you meant style? ​

When should you not do this at all? ​

Key Takeaways ​

Frequently Asked Questions ​

When should I compare two corpora statistically instead of just reading them? ​

What is the difference between keyness and a simple frequency difference? ​

Why is log-likelihood not enough on its own? ​

Do my two corpora need to be the same size to compare them? ​

What sample size do I need for a reliable corpus comparison? ​

How do I avoid being fooled by topic instead of style? ​

Related reading ​

When does a statistical comparison actually fit?

How do you actually run the comparison?

Why is log-likelihood not enough on its own?

What are the hidden costs and trade-offs?

How do you avoid comparing topic when you meant style?

When should you not do this at all?

Key Takeaways

Frequently Asked Questions

When should I compare two corpora statistically instead of just reading them?

What is the difference between keyness and a simple frequency difference?

Why is log-likelihood not enough on its own?

Do my two corpora need to be the same size to compare them?

What sample size do I need for a reliable corpus comparison?

How do I avoid being fooled by topic instead of style?

Related reading