Skip to content
Text Mining & Corpora

To build a frequency list, tokenise your corpus, count distinct word forms, and record both raw counts and relative frequency (count per total tokens, usually scaled per million words). To compare lists fairly across corpora of different sizes, normalise both to relative frequency, align on shared forms, and rank the differences with log-likelihood rather than eyeballing counts. The recurring mistake is comparing raw counts across unequal corpora — always normalise first.

A frequency list is the simplest summary of a corpus and the foundation for keywords, collocations and trend analysis. Its power is in comparison: one list describes a corpus, but the difference between two lists is where the historical questions live.

How do I build a frequency list?

Count tokens, then store counts and relative frequencies together:

python
from collections import Counter

tokens = open("derived/corpus_a.txt", encoding="utf-8").read().split()
total = len(tokens)
counts = Counter(tokens)

# raw count and relative frequency per million words
for word, n in counts.most_common(10):
    rel = n / total * 1_000_000
    print(f"{word:12} {n:6}  {rel:9.1f} pmw")

Keep the full list, including function words, as your master. Filtering and stopword removal are per-analysis decisions you make on copies, not on the canonical list.

Why must I use relative frequency to compare?

Because raw counts confound usage with size. If corpus A has 2 million words and corpus B has 200,000, A will show higher raw counts for nearly everything — telling you only that A is bigger. Relative frequency removes that:

WordCorpus A rawCorpus B rawA per-millionB per-million
liberty1,200180600900
trade90060450300

On raw counts A looks ahead on both words; normalised, B actually uses liberty more intensively. Only the per-million columns support a fair claim.

How do I compare two lists statistically?

Eyeballing relative frequencies misses which differences are large relative to corpus size. Log-likelihood ranks them properly:

python
import math

def log_likelihood(a, b, total_a, total_b):
    e_a = total_a * (a + b) / (total_a + total_b)
    e_b = total_b * (a + b) / (total_a + total_b)
    ll = 0.0
    if a: ll += a * math.log(a / e_a)
    if b: ll += b * math.log(b / e_b)
    return 2 * ll

# a, b = raw counts of the word in each corpus
print(round(log_likelihood(1200, 180, 2_000_000, 200_000), 1))

A higher score means the word's frequency differs more than chance would explain. Rank every shared word this way to find what genuinely distinguishes the two corpora.

Should I remove stopwords first?

Not from the master list. Function words sit at the top of every frequency list and look like noise, but they are signal for stylometry, authorship and some discourse analysis. Build the full list, then filter only when a specific method calls for it:

python
STOP = set(open("stopwords_en.txt", encoding="utf-8").read().split())
content = Counter({w: n for w, n in counts.items() if w not in STOP})

Removing them early is a one-way decision that quietly forecloses analyses you might later want.

How do inflected and archaic forms affect the counts?

A plain frequency list counts surface forms. run, runs and ran are three entries; olde and old never merge. If your question is about a word's underlying use, lemmatise or normalise spelling before counting:

python
import spacy
nlp = spacy.load("en_core_web_sm")
lemmas = [t.lemma_ for t in nlp(" ".join(tokens)) if t.is_alpha]
lemma_counts = Counter(lemmas)

Decide consciously: surface-form counts answer "which spellings appear", lemma counts answer "which words appear". Report which you used.

What should I check before trusting a comparison?

Three quick checks catch most errors:

  • Same tokenisation on both corpora — a different split makes the comparison meaningless.
  • Both normalised to relative frequency before any claim.
  • Top differences traced to concordance lines, so the statistic reflects real usage, not OCR noise.

A word topping your log-likelihood ranking because it is an OCR artefact in one corpus is a data-quality finding, not a historical one.

Key Takeaways

  • A frequency list ranks word forms by count; keep raw counts and relative frequency together.
  • Always normalise to relative frequency before comparing corpora of different sizes.
  • Use log-likelihood to rank which differences are large relative to corpus size.
  • Keep a full master list; remove stopwords only per analysis, on copies.
  • Plain lists count surface forms — lemmatise or normalise if you need underlying words.
  • Verify identical tokenisation and trace top differences to real lines before claiming them.

Frequently Asked Questions

What is a frequency list?

A frequency list ranks every distinct word in a corpus by how often it occurs, usually as both a raw count and a relative frequency per thousand or million words. It is the most basic and most reusable summary of a text collection.

Why use relative frequency instead of raw counts?

Because raw counts let larger texts or corpora dominate any comparison. Relative frequency — count divided by total tokens, often scaled to per-million-words — puts collections of different sizes on the same footing so the comparison is fair.

How do I compare two frequency lists fairly?

Normalise both to relative frequency, align them on shared word forms, and use a statistic such as log-likelihood to rank differences rather than eyeballing raw counts. That tells you which differences are large relative to corpus size.

Should I remove stopwords before building a frequency list?

Keep a full list first, then filter for specific analyses. Function words dominate the top of every list but are exactly what stylometry and some discourse studies rely on, so removing them too early discards signal you may need.

How do frequency lists handle inflected or archaic forms?

A plain frequency list counts surface forms, so 'run', 'runs' and 'ran' are separate entries and 'olde' and 'old' do not merge. Lemmatise or normalise spelling first if you want counts grouped by underlying word.

What is the most common error when comparing corpora?

Comparing raw counts across corpora of different sizes, which makes the bigger corpus look like it uses every word more. Always normalise to relative frequency before comparing, and report the measure you used.