Appearance
To detect language in mixed historical sources, segment the text small (per line or clause, not per document), restrict the candidate languages to the ones your collection actually contains, and combine an n-gram detector with archive metadata and a confidence threshold. Off-the-shelf detectors assume modern, monolingual input — historical sources break both assumptions, so the workflow matters more than the model.
Why is this harder than it looks?
A single early-modern letter may switch between Latin, French and English in one paragraph. Detectors trained on clean modern web text stumble on archaic spelling, lack labels for languages like Church Slavonic or Occitan, and need more characters than a one-line register entry provides. Treating language ID as a solved API call is the core mistake.
How should I segment the text?
Match segmentation to the granularity of switching:
- Document level — only when each item is reliably monolingual.
- Paragraph / line level — the sensible default for mixed archives.
- Clause level — for genuine intra-sentential code-switching.
python
import fasttext
model = fasttext.load_model("lid.176.bin")
def detect_lines(text, k=1):
out = []
for line in text.splitlines():
line = line.strip()
if len(line) < 20: # too short to trust
out.append((line, "UNKNOWN", 0.0)); continue
labels, probs = model.predict(line, k=k)
out.append((line, labels[0].replace("__label__", ""), float(probs[0])))
return outWhich tools should I reach for?
| Tool | Strength | Watch out for |
|---|---|---|
fastText lid.176 | 176 languages, fast | modern-trained |
| CLD3 | robust on short-ish text | fixed language set |
langid.py | constrain to a candidate set | needs tuning |
| custom char n-gram | any languages you train | needs sample data |
The most useful trick across all of them is to restrict the candidate set. If your collection is only ever Latin, French and English, telling the detector that alone removes most errors.
How do I detect languages modern tools do not know?
Train a small character n-gram classifier. A few hundred lines per language usually suffices.
python
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
clf = Pipeline([
("vec", CountVectorizer(analyzer="char", ngram_range=(2, 4))),
("lr", LogisticRegression(max_iter=1000)),
])
clf.fit(train_lines, train_langs) # e.g. ["lat","fro","ang", ...]Character n-grams are spelling-robust, which suits noisy historical text far better than word-based models.
How do I use confidence and metadata together?
Never accept a low-confidence guess silently. Set a threshold (for example, 0.65) and route anything below it to a review queue. Then fuse with what the archive already tells you: a catalogue noting "in French and Latin" constrains the candidate set per item, lifting accuracy with zero extra modelling.
python
THRESH = 0.65
def decide(line, lang, prob, allowed):
if prob < THRESH or (allowed and lang not in allowed):
return "REVIEW"
return langHow do I evaluate language ID on historical text?
Hand-label a few hundred segments stratified by language and length. Report accuracy per language and per length bucket — overall accuracy hides that short Latin tags or rare-language lines are where the system fails. That breakdown is what tells you whether to trust automated tagging or fall back to review.
Key Takeaways
- Segment per line or clause; document-level detection hides code-switching.
- Restrict the candidate set to languages your collection actually contains.
- fastText
lid.176, CLD3 andlangid.pyare solid starting points. - Character n-gram classifiers handle unsupported and noisy languages well.
- Reject segments under roughly 20 characters or below a confidence threshold.
- Fuse detection with archive metadata to constrain candidates per item.
- Evaluate per language and per length bucket, not just overall accuracy.
Frequently Asked Questions
Why do modern language detectors fail on historical sources?
They are trained on modern, monolingual web text, so archaic spelling, code-switching within a sentence, and very short segments confuse them. They also lack labels for dead or regional languages such as Church Slavonic or Occitan.
What is the smallest text a language detector can handle reliably?
Most detectors need at least 20 to 50 characters for a confident guess, and accuracy falls sharply below that. For short historical entries, combine detection with metadata and manual review.
How do I detect language when it switches mid-sentence?
Segment below the document level, ideally per line or per clause, and run detection on each segment. Document-level detection only reports the majority language and hides the code-switching you often care about.
Which library should I use for historical language identification?
fastText's lid.176 model and CLD3 are strong general starting points, and langid.py is easy to constrain to a fixed language set. For historical languages you usually need to restrict candidates and add custom rules.
Can I detect historical languages that modern tools do not support?
Yes, by training a lightweight character n-gram classifier on samples of those languages. A few hundred lines per language is often enough for usable accuracy on well-defined sets.