Appearance
To parse historical syntax, normalise and POS-tag the text first, choose a Universal Dependencies parser, and — wherever a historical treebank exists — train or fine-tune on it rather than relying on a modern-only model. Dependency parsing assigns each word a grammatical head and relation, but historical word order and case marking trip up modern parsers, so adaptation is the difference between a usable parse and a misleading one.
What is dependency parsing, briefly?
A dependency parse links each word to its syntactic head with a labelled relation: subject, object, modifier, and so on. For historians it powers questions like "what verbs took this noun as object" or "how did relative-clause structure change over time". The output is a tree per sentence, usually in CoNLL-U format.
Why is historical syntax hard to parse?
Modern parsers are trained on modern word order. Historical languages differ in ways the training data never shows:
- Verb-second / verb-final order in older Germanic and Latin.
- Free word order enabled by case marking.
- Discontinuous constituents (hyperbaton) common in Latin and Greek verse.
- OCR and spelling noise layered on top.
A parser handles broad structure tolerably but mislabels exactly the constructions a syntactician cares about — so never trust fine-grained relations without checking.
Step 1 to 3: prepare the input
Parsing inherits every upstream error, so prepare carefully.
python
import spacy
nlp = spacy.load("la_core_web_lg") # or a fine-tuned historical model
text = normalise_and_tag(raw) # your normalisation + tagging step
doc = nlp(text)
for tok in doc:
print(tok.text, tok.dep_, tok.head.text)- Normalise spelling so tokens are recognisable.
- Tokenise with period-aware rules.
- POS tag (and lemmatise) — parsers use tags as features, so weak tags mean weak parses.
Which parser and scheme should I use?
Default to Universal Dependencies. It is the lingua franca: parsers, treebanks and evaluation tools all speak it, and it lets you compare across periods.
| Language / period | Treebank / model |
|---|---|
| Latin | PROIEL, Perseus, ITTB (UD) |
| Ancient Greek | PROIEL, Perseus (UD) |
| Old/Middle English | YCOE-derived, PPCME2 (convertible) |
| Historical German | HiTS-based resources |
Step 4: parse and inspect the tree
Export to CoNLL-U and view it. Reading the actual tree, not just metrics, is how you catch systematic failures.
text
1 Gallia Gallia PROPN nsubj 2
2 est sum AUX cop 4
3 omnis omnis DET amod 4
4 divisa divido VERB root 0Look specifically at long-distance dependencies and coordinated structures — these are where historical parsers most often go wrong.
Should I train my own parser?
If a period treebank exists for your language, yes — fine-tuning on it beats a modern model substantially, often by double-digit LAS points on Latin and Greek. Training a UD parser is a one-command job with spaCy or Stanza once you have the treebank in CoNLL-U.
bash
python -m spacy train config.cfg \
--paths.train ud_latin_proiel-train.spacy \
--paths.dev ud_latin_proiel-dev.spacyHow do I evaluate the result?
Report UAS (unlabelled attachment score — is the head right?) and LAS (is the head and relation right?) against a hand-checked gold set. Then read the errors. Historical parsers fail in patterns: they misattach displaced modifiers or mislabel non-canonical subjects. The error analysis, not the single number, is what you report to a syntactician.
Key Takeaways
- Normalise, tokenise, and POS-tag before parsing; errors cascade.
- Use Universal Dependencies for tool support and cross-period comparison.
- Modern parsers handle broad structure but mislabel non-modern constructions.
- Fine-tune on a period treebank when one exists — large gains for Latin/Greek.
- Inspect the actual tree, focusing on long-distance and coordinated relations.
- Evaluate with UAS and LAS plus a real error analysis, not just one score.
- Treat fine-grained relation labels as provisional until hand-checked.
Frequently Asked Questions
Can a modern dependency parser handle historical word order?
Partly. Modern parsers struggle with verb-second order, object-verb constructions and case-marked arguments that modern training data lacks. Accuracy is acceptable for broad structure but unreliable for fine-grained relations without adaptation.
What annotation scheme should I use for historical parsing?
Universal Dependencies is the practical default because parsers and treebanks support it and it allows cross-period comparison. Historical-specific treebanks like PROIEL also use UD-compatible schemes.
Do I need POS tags before dependency parsing?
Yes. Dependency parsers rely on POS tags as features, so tag quality directly limits parse quality. Tag and ideally lemmatise the text first, on normalised forms.
How do I evaluate a historical dependency parse?
Use labelled and unlabelled attachment scores (LAS and UAS) against a hand-checked gold set. Read the errors, since historical parsers fail systematically on specific constructions rather than randomly.
Is it worth training a parser on historical treebanks?
Yes when one exists for your language, since training or fine-tuning on a period treebank substantially outperforms a modern-only model. For Latin and Ancient Greek, mature UD treebanks make this straightforward.