Skip to content
NLP for Historical Text

To build a historical lexicon well, seed it semi-automatically from your corpus and existing dictionaries, then hand-curate the frequent and ambiguous entries, storing one variant per row with a headword, frequency, source reference, and a date-stamped sense. The goal is not a big word list but a documented, versioned, reusable resource where every mapping can be traced to where you saw it. Editorial guidelines and schema validation, not raw size, are what make a lexicon defensible across a whole collection.

What is a historical lexicon for?

A historical lexicon links the messy variety of historical word forms to something a machine can act on: a canonical headword, a lemma, a modern equivalent, or a sense. It underpins spelling normalisation, lemmatisation, search-term expansion, and gazetteer-assisted NER. In practice a well-curated lexicon often lifts downstream accuracy more than swapping in a larger model, because it injects exactly the domain knowledge the model lacks.

How should I structure the data?

Use one variant per row, each pointing to a shared headword, with provenance and frequency. Flat, auditable, and easy to query:

csv
variant,headword,lemma,modern,first_seen,last_seen,freq,source
loue,love,love,love,1500,1650,842,EEBO-TCP
louue,love,love,love,1520,1590,17,EEBO-TCP
vpon,upon,upon,upon,1480,1700,3104,EEBO-TCP
Yorke,York,York,York,1400,1750,560,gazetteer-v2

The source and frequency columns are not optional decoration — they let a reviewer judge how much weight a mapping deserves and trace it back to evidence.

Should I build it by hand or automatically?

Both, in sequence. Bootstrap coverage automatically, then curate where it matters:

python
from collections import Counter

# 1. harvest candidate variants from the corpus
freqs = Counter(tok.lower() for doc in corpus for tok in doc.split())

# 2. propose a headword via a modern wordlist + edit distance (capped!)
def propose(variant, wordlist):
    cands = [w for w in wordlist if abs(len(w) - len(variant)) <= 2]
    return min(cands, key=lambda w: edit_distance(variant, w), default=variant)

# 3. write high-frequency / ambiguous rows to a HUMAN review queue
review = [(v, propose(v, wordlist)) for v, c in freqs.most_common(2000)]

Cap the edit distance at 1 or 2, and never auto-accept proper nouns — that is how Bath silently becomes both. The frequent and ambiguous entries go to a human; the long tail of obvious cases can be machine-confirmed.

How do I handle meaning that changed over time?

Attach a date range and sense to each entry instead of a single timeless gloss. A form can carry one sense in 1500 and another in 1700; collapsing them produces anachronistic mappings. The first_seen/last_seen columns above and a sense field let you keep period-correct entries side by side and select the right one by document date.

Which format keeps it reusable?

FormatBest forTrade-off
TEI dictionary XMLScholarly editions, interoperabilityVerbose, steeper learning curve
Documented CSV/TSVPipelines, quick reuseNeeds a separate schema doc
SKOS / linked dataLinking to external vocabulariesHeavier tooling

Whatever you pick, version it (Git), document every column, and give each entry a persistent identifier so others can cite and reuse it. An undocumented spreadsheet is not a lexicon; it is a liability.

How do I keep it consistent across a collection?

Consistency comes from rules and checks, not from individual judgement. Write editorial guidelines (how to pick a headword, how to treat capitalisation, what counts as a separate sense), validate every entry against a schema, and review a sample for inter-annotator agreement. Re-run the schema check in CI so a malformed entry never reaches users.

python
REQUIRED = {"variant", "headword", "first_seen", "source"}
def validate(row):
    missing = REQUIRED - row.keys()
    assert not missing, f"row {row.get('variant')} missing {missing}"

A working build checklist

text
1. Define scope: language, period, which fields you will record.
2. Harvest variants + frequencies from the actual corpus.
3. Auto-propose headwords (edit distance capped, proper nouns excluded).
4. Hand-curate high-frequency and ambiguous entries.
5. Add date ranges and senses; flag meaning shifts.
6. Validate against a schema; assign persistent IDs; version in Git.
7. Document fields + editorial guidelines; check IAA on a sample.

Key Takeaways

  • A historical lexicon maps variant forms to headwords, lemmas, senses or modern equivalents.
  • Store one variant per row with frequency and a source reference for full auditability.
  • Combine automatic harvesting for coverage with hand curation for the entries that matter.
  • Cap edit-distance matching and exclude proper nouns to avoid silent over-correction.
  • Stamp entries with date ranges and senses so meaning shifts do not cause anachronisms.
  • Use TEI or a documented CSV, version it, and give each entry a persistent identifier.
  • Enforce consistency with editorial guidelines, schema validation and IAA checks.

Frequently Asked Questions

What is a historical lexicon used for in NLP?

It maps historical word forms to canonical headwords, lemmas, modern equivalents or senses, and powers normalisation, lemmatisation, search expansion and gazetteer-assisted NER. A good lexicon often improves downstream accuracy more than a bigger model does.

Should I build the lexicon by hand or extract it automatically?

Use both. Seed it automatically from your corpus and existing dictionaries to get coverage, then curate the high-frequency and ambiguous entries by hand. Pure automation is noisy; pure hand-work does not scale.

How do I record spelling variants?

Store each variant as its own row pointing to a shared headword, with a frequency count and a source reference. A one-variant-per-row structure keeps the data auditable and lets you trace any mapping back to where you saw it.

How do I handle words whose meaning changed over time?

Attach a date range and sense to each entry rather than a single timeless definition. The same form can mean different things in 1500 and 1700, so a period-stamped sense field prevents anachronistic mappings.

What format should a historical lexicon use?

For interoperability use TEI dictionary encoding or at minimum a documented CSV/TSV with stable columns. Whatever you choose, version it, document every field, and assign each entry a persistent identifier so others can cite and reuse it.

How do I keep the lexicon consistent across a whole collection?

Write explicit editorial guidelines, validate entries against a schema, and review a sample for inter-annotator agreement. Consistency comes from documented rules and automated checks, not from individual judgement applied ad hoc.