Best Practices to Run n-gram analysis

Run n-gram analysis by fixing four decisions before you count a single token: the value of n, your tokenisation rules, whether spelling is normalised, and your minimum frequency threshold. Lock those in writing, apply them identically to every document, and your bigram and trigram tables will be reproducible and defensible across the whole collection rather than an accident of one afternoon's settings.

What exactly is an n-gram, and why pin down the definition?

An n-gram is a contiguous sequence of n tokens. "the king of" is a trigram; "the king" is a bigram. The word "token" is doing heavy lifting, because whether King's counts as one token or two (King + 's) changes every count downstream. For historical material the definition has to handle long-s, hyphenated line breaks and inconsistent casing, so decide once and encode the rule.

How do I run n-gram analysis in practice?

Here is a minimal, reproducible pass with NLTK that you can lift into a notebook:

python

from nltk import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter

text = open("doc.txt", encoding="utf-8").read().lower()
tokens = [t for t in word_tokenize(text) if t.isalpha()]

trigrams = Counter(ngrams(tokens, 3))
for gram, count in trigrams.most_common(25):
    if count >= 3:
        print(count, " ".join(gram))

The t.isalpha() filter quietly removes punctuation and most OCR debris. The count >= 3 threshold suppresses the singleton tail that otherwise floods historical lists.

Which value of n should I choose?

n	Captures	Best for	Risk
1	single words	frequency baselines	no context
2	word pairs	collocations, names	moderate sparsity
3	short phrases	formulae, idioms	sparsity rises
4-5	fixed strings	legal boilerplate	mostly singletons

In practice bigrams and trigrams do almost all the useful work. Reserve 4- and 5-grams for spotting verbatim repetition, such as recycled charter clauses.

What about OCR noise and spelling variation?

Noisy input is the defining problem of historical n-grams. Three defences stack well: a frequency threshold removes one-off scanning errors; a non-alphabetic filter strips rn/m confusions that produce junk tokens; and optional spelling normalisation collapses variants so honour and honor do not split a count in two. Run the analysis both before and after normalisation and keep both outputs, because the unnormalised surface form is itself evidence.

Should I weight raw counts or use statistical association?

Raw frequency answers "what is common"; association measures answer "what belongs together". For the latter, score bigrams with pointwise mutual information (PMI) or log-likelihood rather than counting alone:

python

from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(5)
for pair in finder.nbest(BigramAssocMeasures.likelihood_ratio, 20):
    print(pair)

The apply_freq_filter(5) call is essential: without a floor, PMI rewards rare pairs that happen to co-occur once, which is exactly the noise you want to exclude.

How do I document the run so it is defensible?

Write a short provenance note alongside every table: corpus name and version, total token count, tokenizer and version, casing policy, normalisation flag, value of n, and the frequency threshold. When a reviewer asks why "ye olde" appears, you can point to the exact pipeline that produced it. A run you cannot reconstruct is an anecdote, not a finding.

Key Takeaways

Decide n, tokenisation, normalisation and the frequency threshold before counting, and apply them uniformly.
Bigrams and trigrams carry most of the signal; treat 4- and 5-grams as repetition detectors.
A minimum frequency of 3-5 plus an alphabetic-token filter removes most OCR noise.
Keep both raw and normalised outputs, since the historical surface form is evidence.
Use PMI or log-likelihood with a frequency floor when you want association, not just counts.
Record the full provenance of every table so results can be reconstructed and defended.

Frequently Asked Questions

What value of n should I use for historical corpora?

Bigrams and trigrams (n=2, n=3) carry most of the signal for collocation and phrase work. Go to 4-grams or 5-grams only for fixed formulae like legal boilerplate, and expect data sparsity to bite quickly above n=3.

Should I count n-grams before or after normalising spelling?

Run both. Raw counts preserve the historical surface form you may want to study, while normalised counts let you aggregate variants like 'publick' and 'public'. Always record which version produced which table.

How do I stop OCR garbage from dominating my n-gram lists?

Apply a minimum frequency threshold (often 3-5), filter tokens that contain non-alphabetic characters, and drop n-grams whose component tokens fall below a confidence or dictionary check. Inspect the long tail manually before publishing.

Do I need to remove stop words before counting n-grams?

Not always. Stop words are part of genuine phrases ('king of England'), so removing them distorts collocations. Keep them for phrase discovery and remove them only for topic-oriented keyword work, documenting the choice.

What is the difference between an n-gram and a collocation?

An n-gram is simply n adjacent tokens counted by position. A collocation is an n-gram whose co-occurrence is statistically stronger than chance, measured with scores like PMI or log-likelihood.

How large a corpus do I need for trigram analysis?

There is no hard floor, but below roughly 100,000 tokens trigram counts become noisy and most appear only once. Report your token count alongside any frequency table so readers can judge reliability.

What exactly is an n-gram, and why pin down the definition? ​

How do I run n-gram analysis in practice? ​

Which value of n should I choose? ​

What about OCR noise and spelling variation? ​

Should I weight raw counts or use statistical association? ​

How do I document the run so it is defensible? ​

Key Takeaways ​

Frequently Asked Questions ​

What value of n should I use for historical corpora? ​

Should I count n-grams before or after normalising spelling? ​

How do I stop OCR garbage from dominating my n-gram lists? ​

Do I need to remove stop words before counting n-grams? ​

What is the difference between an n-gram and a collocation? ​

How large a corpus do I need for trigram analysis? ​

Related reading ​