Appearance
To build a historical lexicon well, seed it semi-automatically from your corpus and existing dictionaries, then hand-curate the frequent and ambiguous entries, storing one variant per row with a headword, frequency, source reference, and a date-stamped sense. The goal is not a big word list but a documented, versioned, reusable resource where every mapping can be traced to where you saw it. Editorial guidelines and schema validation, not raw size, are what make a lexicon defensible across a whole collection.
What is a historical lexicon for?
A historical lexicon links the messy variety of historical word forms to something a machine can act on: a canonical headword, a lemma, a modern equivalent, or a sense. It underpins spelling normalisation, lemmatisation, search-term expansion, and gazetteer-assisted NER. In practice a well-curated lexicon often lifts downstream accuracy more than swapping in a larger model, because it injects exactly the domain knowledge the model lacks.
How should I structure the data?
Use one variant per row, each pointing to a shared headword, with provenance and frequency. Flat, auditable, and easy to query:
csv
variant,headword,lemma,modern,first_seen,last_seen,freq,source
loue,love,love,love,1500,1650,842,EEBO-TCP
louue,love,love,love,1520,1590,17,EEBO-TCP
vpon,upon,upon,upon,1480,1700,3104,EEBO-TCP
Yorke,York,York,York,1400,1750,560,gazetteer-v2The source and frequency columns are not optional decoration — they let a reviewer judge how much weight a mapping deserves and trace it back to evidence.
Should I build it by hand or automatically?
Both, in sequence. Bootstrap coverage automatically, then curate where it matters:
python
from collections import Counter
# 1. harvest candidate variants from the corpus
freqs = Counter(tok.lower() for doc in corpus for tok in doc.split())
# 2. propose a headword via a modern wordlist + edit distance (capped!)
def propose(variant, wordlist):
cands = [w for w in wordlist if abs(len(w) - len(variant)) <= 2]
return min(cands, key=lambda w: edit_distance(variant, w), default=variant)
# 3. write high-frequency / ambiguous rows to a HUMAN review queue
review = [(v, propose(v, wordlist)) for v, c in freqs.most_common(2000)]Cap the edit distance at 1 or 2, and never auto-accept proper nouns — that is how Bath silently becomes both. The frequent and ambiguous entries go to a human; the long tail of obvious cases can be machine-confirmed.
How do I handle meaning that changed over time?
Attach a date range and sense to each entry instead of a single timeless gloss. A form can carry one sense in 1500 and another in 1700; collapsing them produces anachronistic mappings. The first_seen/last_seen columns above and a sense field let you keep period-correct entries side by side and select the right one by document date.
Which format keeps it reusable?
| Format | Best for | Trade-off |
|---|---|---|
| TEI dictionary XML | Scholarly editions, interoperability | Verbose, steeper learning curve |
| Documented CSV/TSV | Pipelines, quick reuse | Needs a separate schema doc |
| SKOS / linked data | Linking to external vocabularies | Heavier tooling |
Whatever you pick, version it (Git), document every column, and give each entry a persistent identifier so others can cite and reuse it. An undocumented spreadsheet is not a lexicon; it is a liability.
How do I keep it consistent across a collection?
Consistency comes from rules and checks, not from individual judgement. Write editorial guidelines (how to pick a headword, how to treat capitalisation, what counts as a separate sense), validate every entry against a schema, and review a sample for inter-annotator agreement. Re-run the schema check in CI so a malformed entry never reaches users.
python
REQUIRED = {"variant", "headword", "first_seen", "source"}
def validate(row):
missing = REQUIRED - row.keys()
assert not missing, f"row {row.get('variant')} missing {missing}"A working build checklist
text
1. Define scope: language, period, which fields you will record.
2. Harvest variants + frequencies from the actual corpus.
3. Auto-propose headwords (edit distance capped, proper nouns excluded).
4. Hand-curate high-frequency and ambiguous entries.
5. Add date ranges and senses; flag meaning shifts.
6. Validate against a schema; assign persistent IDs; version in Git.
7. Document fields + editorial guidelines; check IAA on a sample.Key Takeaways
- A historical lexicon maps variant forms to headwords, lemmas, senses or modern equivalents.
- Store one variant per row with frequency and a source reference for full auditability.
- Combine automatic harvesting for coverage with hand curation for the entries that matter.
- Cap edit-distance matching and exclude proper nouns to avoid silent over-correction.
- Stamp entries with date ranges and senses so meaning shifts do not cause anachronisms.
- Use TEI or a documented CSV, version it, and give each entry a persistent identifier.
- Enforce consistency with editorial guidelines, schema validation and IAA checks.
Frequently Asked Questions
What is a historical lexicon used for in NLP?
It maps historical word forms to canonical headwords, lemmas, modern equivalents or senses, and powers normalisation, lemmatisation, search expansion and gazetteer-assisted NER. A good lexicon often improves downstream accuracy more than a bigger model does.
Should I build the lexicon by hand or extract it automatically?
Use both. Seed it automatically from your corpus and existing dictionaries to get coverage, then curate the high-frequency and ambiguous entries by hand. Pure automation is noisy; pure hand-work does not scale.
How do I record spelling variants?
Store each variant as its own row pointing to a shared headword, with a frequency count and a source reference. A one-variant-per-row structure keeps the data auditable and lets you trace any mapping back to where you saw it.
How do I handle words whose meaning changed over time?
Attach a date range and sense to each entry rather than a single timeless definition. The same form can mean different things in 1500 and 1700, so a period-stamped sense field prevents anachronistic mappings.
What format should a historical lexicon use?
For interoperability use TEI dictionary encoding or at minimum a documented CSV/TSV with stable columns. Whatever you choose, version it, document every field, and assign each entry a persistent identifier so others can cite and reuse it.
How do I keep the lexicon consistent across a whole collection?
Write explicit editorial guidelines, validate entries against a schema, and review a sample for inter-annotator agreement. Consistency comes from documented rules and automated checks, not from individual judgement applied ad hoc.