Appearance
To normalise historical spelling, map each variant surface form (for example vpon, loue, ſhe) to a consistent target form using a lookup dictionary first, then a learned or rule-based fallback for words the dictionary misses. The single most important rule is to keep the original token alongside the normalised one so nothing is destroyed. Everything else is tuning.
Why normalise at all?
Historical orthography was not standardised. Early Modern English alone might spell publicly a dozen ways. Downstream tools — POS taggers, NER models, search indexes — were trained on modern spelling, so an unnormalised corpus silently degrades every step after it. Normalising lifts recall on search and improves tagger accuracy by 10 to 25 percentage points on noisy Early Modern material in my experience.
But normalisation is lossy. The choice is not whether to lose information but where to record it so it is recoverable.
What should the target form be?
Decide this before you write a line of code:
- Canonical historical form — collapse
loue/loveto one period-appropriate spelling. Best for linguistic study. - Modern form — map everything to today's spelling. Best for search and for feeding modern NLP models.
Pick one and document it in your project README. Mixing them is the most common cause of irreproducible pipelines.
A minimal, auditable pipeline
Start with a transparent lookup table. This is boring and it works.
python
import csv, json
# variant -> normalised, built from a gold list you trust
with open("norm_dict.csv", encoding="utf-8") as f:
table = {r["variant"]: r["target"] for r in csv.DictReader(f)}
def normalise(tokens):
out = []
for tok in tokens:
low = tok.lower()
out.append({"orig": tok, "norm": table.get(low, tok)})
return out
print(json.dumps(normalise(["Vpon", "loue", "she"]), ensure_ascii=False))Note the output keeps orig and norm side by side. Never collapse them into one string.
Which tools are worth knowing?
| Tool | Approach | When it fits |
|---|---|---|
| Norma | trainable, dictionary + rules + distance | German/medieval, well-documented gold data |
| VARD 2 | interactive + batch, Early Modern English | EME plays, letters, sermons |
| cltk | classical languages | Latin, Ancient Greek |
| custom lookup | exact mapping | small, well-known variant sets |
For unseen tokens, a Levenshtein-distance fallback against a modern wordlist catches many cases, but cap the edit distance at 1 or 2. Beyond that you start "correcting" rare proper nouns into common words.
How do I handle the long-s and ligatures?
Pre-clean glyph-level noise before spelling normalisation: replace ſ with s, expand æ/œ if your target requires it, and decide on u/v and i/j interchange (a separate, rule-based pass). Doing these as their own stage keeps the spelling dictionary small and your changes traceable.
python
GLYPH = {"ſ": "s", "ſt": "st"}
text = "".join(GLYPH.get(c, c) for c in raw)What pitfalls bite people most?
- Over-correction. A greedy fuzzy match turns the place name Bath into both. Always keep proper nouns out of the fuzzy fallback using a gazetteer stop-list.
- Case loss. Lowercasing before normalisation is fine for matching, but reapply the original casing to the surface form.
- One date for everything. Spelling conventions shift by decade; a 1590s table misfires on 1680s text. Segment your corpus by period.
How do I evaluate it?
Hand-correct 300 to 500 tokens as gold data. Report two numbers: accuracy, and tokens made worse (correct originals that normalisation broke). The second number is the one that protects your scholarly credibility.
Key Takeaways
- Always keep the original token; normalisation must be reversible.
- Choose canonical-historical or modern target form explicitly and write it down.
- Lookup table first, fuzzy fallback second, capped at edit distance 1 to 2.
- Clean glyphs (long-s, ligatures, u/v, i/j) in a separate, earlier pass.
- Protect proper nouns from the fuzzy stage with a gazetteer stop-list.
- Segment by period; spelling conventions are not stable across a century.
- Evaluate with a gold set and report over-correction, not just accuracy.
Frequently Asked Questions
Should I normalise spelling before or after tokenisation?
Normalise after tokenisation in almost all cases. You want a token-aligned mapping so each modern form points back to its original surface form, which keeps your transformations reversible and auditable.
Does normalising spelling lose linguistic information?
Yes, which is why you should never overwrite the original text. Keep the historical form in a parallel column or annotation layer so dialectologists and editors can still study variation.
What is the difference between normalisation and modernisation?
Normalisation maps variant spellings to a single canonical historical form, while modernisation maps them to present-day spelling. Pick one explicitly; mixing them silently makes your output impossible to reproduce.
Which tool should a beginner start with?
Start with Norma or a simple lookup-table approach in Python. They are transparent and easy to audit, which matters far more than raw accuracy when you are learning the failure modes.
How do I evaluate normalisation quality?
Build a small gold-standard set of 300 to 500 hand-corrected token pairs and report accuracy plus the count of tokens you changed for the worse. Word accuracy alone hides damaging over-correction.