Appearance
To adapt spaCy to historical text, customise the tokenizer for period punctuation and abbreviations, insert a normalisation step before the tagger, disable components you do not need, and pin every version so the pipeline is reproducible across an entire collection. spaCy is highly adaptable, but the defaults assume modern, clean prose — and historical sources are neither.
Why not just run the default English pipeline?
Because the default tokenizer mis-splits early-modern contractions, the tagger has never seen thou hast, and the NER model invents entities from OCR noise. Each wrong token propagates: a bad split breaks the tag, the tag breaks the parse. Adapting spaCy is mostly about controlling that cascade at the earliest possible stage.
Where should normalisation live in the pipeline?
Put it before tagging, as a custom component, and store both forms on the token.
python
import spacy
from spacy.tokens import Token
Token.set_extension("norm", default="", force=True)
NORM = {"vpon": "upon", "loue": "love", "hath": "has"}
@spacy.Language.component("historical_norm")
def historical_norm(doc):
for t in doc:
t._.norm = NORM.get(t.text.lower(), t.text)
return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("historical_norm", first=True)Downstream components can read token._.norm while the original token.text stays intact for display and citation.
How do I fix the tokenizer?
Add special cases and adjust the infix/suffix rules so abbreviation marks and old punctuation behave.
python
from spacy.symbols import ORTH
nlp.tokenizer.add_special_case("y^e", [{ORTH: "y^e"}]) # = "the"
nlp.tokenizer.add_special_case("&c.", [{ORTH: "&c."}])Test the tokenizer in isolation on a sample page before you trust the rest of the pipeline. A five-minute check here saves hours of confusing downstream errors.
Which components should I disable?
Most historical projects do not need every default component.
| Task | Keep | Disable |
|---|---|---|
| Word frequencies | tokenizer | tagger, parser, NER |
| POS analysis | tagger | parser, NER |
| Entity extraction | tagger, NER | parser |
| Full syntax | tagger, parser | NER (unless needed) |
python
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])Disabling unused pipes can cut runtime by 3 to 5 times on a large corpus, which matters when you process tens of thousands of pages.
How do I handle classical languages?
For Latin and Ancient Greek, use CLTK's spaCy-compatible models rather than forcing an English pipeline. They ship with lemmatisers and treebank-trained taggers tuned to those languages.
What does a reproducibility checklist look like?
- Pin the spaCy version (
spacy==3.7.x) inrequirements.txt. - Pin the model version, not just the name.
- Commit
config.cfgif you trained or fine-tuned anything. - Version-control the normalisation dictionary.
- Record the exact command and a seed for any training run.
Freeze these together. A pipeline that gives different output next year is not a finding, it is an anecdote.
How do I validate the adapted pipeline?
Run it on a 200-token gold page you tagged by hand and diff the output. Track three metrics over time: tokenisation accuracy, tag accuracy, and entity precision. If a metric drops after a change, you know exactly which stage regressed.
Key Takeaways
- Fix the tokenizer first; bad splits poison every later component.
- Add normalisation as a custom component before the tagger, keeping both forms.
- Disable parser/NER when unused to cut runtime 3 to 5 times.
- Use CLTK spaCy models for Latin and Greek rather than English pipelines.
- Pin spaCy version, model version, config, and normalisation dictionary together.
- Validate on a hand-tagged gold page and track per-stage metrics over time.
- Keep
token.textfor citation andtoken._.normfor processing.
Frequently Asked Questions
Should I write a custom spaCy tokenizer for historical text?
Often yes. Historical text uses different punctuation, abbreviation marks and word-break conventions, so adding special cases and a custom infix pattern to the tokenizer prevents errors that cascade through the whole pipeline.
Can I disable spaCy components I do not need?
Yes, and you should. Use nlp.select_pipes or load with exclude to skip the parser or NER when you only need tagging; it speeds up processing several-fold on large collections.
How do I add a normalisation step inside a spaCy pipeline?
Add a custom component before the tagger that sets a token extension such as token._.norm, or pre-normalise text before it enters the pipe. Keep the original text on the Doc so nothing is lost.
Does spaCy have historical language models?
Not officially for most periods, but the community trains them and CLTK integrates with spaCy for classical languages. For most projects you fine-tune a modern model on a historical treebank.
How do I keep results reproducible across a collection?
Pin the spaCy version and model version, record your config.cfg, and store the normalisation dictionary in version control. Reproducibility in DH depends on freezing these three things together.