Best Practices to Tokenise historical text

To tokenise historical text well, use a real tokeniser (spaCy, NLTK or a custom regex) with one documented policy for punctuation, contractions, hyphenation and archaic characters, then apply that exact policy to every document in the corpus. The single biggest risk is not which tokeniser you choose but inconsistency: when half your texts split contractions and half do not, every frequency and collocation you report is quietly wrong. Lock the policy in versioned code and run it over everything.

Tokenisation is the foundation every count sits on. Get it wrong and the error propagates into frequencies, keywords, collocations and models without ever announcing itself — which is why it deserves a deliberate, written-down policy rather than a default you never examined.

Why is whitespace splitting not enough?

text.split() looks tempting and fails in predictable ways:

python

"Liberty, equality; and th' people's right.".split()
# ['Liberty,', 'equality;', 'and', "th'", "people's", 'right.']

Now Liberty, and Liberty are different tokens, the semicolon is glued to equality, and th' is ambiguous. Counts built on this conflate punctuation with words. Whitespace splitting is acceptable for a 30-second glance and nothing you would publish.

Which tokeniser should I use?

For most historical English, spaCy or NLTK gives sane defaults you can then override:

python

import spacy
nlp = spacy.blank("en")            # blank pipeline = tokeniser only, fast
doc = nlp("th' people's right—won.")
print([t.text for t in doc])
# ["th'", 'people', "'s", 'right', '—', 'won', '.']

The point is not perfection on the first pass but a consistent, inspectable split you can adjust. When you need full control over archaic forms, a documented regex tokeniser is legitimate — just keep it in one place.

How do I handle the awkward cases in historical text?

Four cases recur and each needs an explicit decision:

Case	Example	Common policies
Archaic chars	`ſ` long-s	Normalise to `s` \| keep as-is
Contractions	`th'`, `'d`	Single token \| expand
Hyphenation	`sea-coast`	Keep \| split \| rejoin line-breaks only
Possessives	`people's`	Split `'s` \| keep whole

There is no universally right answer; there is only a written-down answer applied to every document. Record the choice in your README so a reader can reproduce your counts.

Should I normalise before or after tokenising?

Generally tokenise first on lightly cleaned text, then normalise the resulting tokens. That way your spelling-normalisation map keys on clean word units (olde to old) rather than punctuation-laden strings (olde,). Pin the order and never let it vary mid-corpus.

python

def normalise(tok: str) -> str:
    return SPELLING_MAP.get(tok.lower(), tok.lower())

tokens = [normalise(t.text) for t in nlp(text) if not t.is_space]

How do I make tokenisation reproducible across the whole collection?

Reproducibility comes from one function applied identically everywhere:

python

import spacy, json
nlp = spacy.blank("en")

def tokenise(text: str) -> list[str]:
    return [t.text for t in nlp(text) if not t.is_space]

# record the contract so others can reproduce it
config = {"library": "spacy", "version": spacy.__version__,
          "pipeline": "blank-en", "policy": "keep-contractions"}
json.dump(config, open("tokeniser_config.json", "w"), indent=2)

Pin the library version. A spaCy minor upgrade can change tokenisation subtly, and an undated config makes a divergence impossible to trace.

What is a quick quality check before I trust the output?

Spot-check the rare and the frequent:

python

from collections import Counter
counts = Counter(tokenise(open("derived/doc_0007.txt", encoding="utf-8").read()))
print(counts.most_common(10))      # are stopwords and punctuation as expected?
print([t for t in counts if len(t) == 1][:20])  # stray single chars = trouble

A flood of single-character tokens or punctuation in the top ten usually means a policy bug, not a finding.

Key Takeaways

Tokenisation underpins every count; treat it as a deliberate, documented policy.
Avoid bare whitespace splitting for anything you intend to report.
Use spaCy or NLTK defaults, then override for archaic forms.
Decide explicitly on long-s, contractions, hyphens and possessives.
Tokenise first on light-cleaned text, then normalise tokens.
Wrap the tokeniser in one versioned function and run it over every document.

Frequently Asked Questions

What is tokenisation and why does it matter for historical text?

Tokenisation splits a stream of text into the units you count — usually words and punctuation. It matters because every downstream number, from frequencies to collocations, depends on it, so an inconsistent tokeniser silently corrupts every result built on top of it.

Can I just split on whitespace?

Only for the crudest exploration. Whitespace splitting keeps trailing punctuation attached, mishandles contractions and hyphens, and treats 'word.' and 'word' as different tokens — fine for a first glance, unsafe for any published count.

How should I handle the long-s and other archaic characters?

Decide explicitly and document it: either normalise the long-s to 's' before tokenising, or preserve it as a distinct character if the orthography is part of your study. The danger is doing it inconsistently across the corpus.

What do I do with contractions and elisions like 'd or th'?

Choose one policy and apply it everywhere: either keep them as single tokens or expand them with a documented rule. Historical elisions are frequent and a silent mix of policies will distort verb and pronoun counts.

Should tokenisation come before or after spelling normalisation?

Usually tokenise first on lightly cleaned text, then normalise tokens, so the normalisation map operates on clean word units rather than punctuation-laden strings. Whatever the order, fix it and apply it identically across the whole corpus.

How do I make tokenisation reproducible across a collection?

Wrap your tokeniser and its settings in a single versioned function, record the library version, and run the same function over every document. Consistency across the collection matters more than which tokeniser you pick.

Why is whitespace splitting not enough? ​

Which tokeniser should I use? ​

How do I handle the awkward cases in historical text? ​

Should I normalise before or after tokenising? ​

How do I make tokenisation reproducible across the whole collection? ​

What is a quick quality check before I trust the output? ​

Key Takeaways ​

Frequently Asked Questions ​

What is tokenisation and why does it matter for historical text? ​

Can I just split on whitespace? ​

How should I handle the long-s and other archaic characters? ​

What do I do with contractions and elisions like 'd or th'? ​

Should tokenisation come before or after spelling normalisation? ​

How do I make tokenisation reproducible across a collection? ​

Related reading ​

Why is whitespace splitting not enough?

Which tokeniser should I use?

How do I handle the awkward cases in historical text?

Should I normalise before or after tokenising?

How do I make tokenisation reproducible across the whole collection?

What is a quick quality check before I trust the output?

Key Takeaways

Frequently Asked Questions

What is tokenisation and why does it matter for historical text?

Can I just split on whitespace?

How should I handle the long-s and other archaic characters?

What do I do with contractions and elisions like 'd or th'?

Should tokenisation come before or after spelling normalisation?

How do I make tokenisation reproducible across a collection?

Related reading