Detect language in mixed sources: A Practical Guide

To detect language in mixed historical sources, segment the text small (per line or clause, not per document), restrict the candidate languages to the ones your collection actually contains, and combine an n-gram detector with archive metadata and a confidence threshold. Off-the-shelf detectors assume modern, monolingual input — historical sources break both assumptions, so the workflow matters more than the model.

Why is this harder than it looks?

A single early-modern letter may switch between Latin, French and English in one paragraph. Detectors trained on clean modern web text stumble on archaic spelling, lack labels for languages like Church Slavonic or Occitan, and need more characters than a one-line register entry provides. Treating language ID as a solved API call is the core mistake.

How should I segment the text?

Match segmentation to the granularity of switching:

Document level — only when each item is reliably monolingual.
Paragraph / line level — the sensible default for mixed archives.
Clause level — for genuine intra-sentential code-switching.

python

import fasttext
model = fasttext.load_model("lid.176.bin")

def detect_lines(text, k=1):
    out = []
    for line in text.splitlines():
        line = line.strip()
        if len(line) < 20:        # too short to trust
            out.append((line, "UNKNOWN", 0.0)); continue
        labels, probs = model.predict(line, k=k)
        out.append((line, labels[0].replace("__label__", ""), float(probs[0])))
    return out

Which tools should I reach for?

Tool	Strength	Watch out for
fastText `lid.176`	176 languages, fast	modern-trained
CLD3	robust on short-ish text	fixed language set
`langid.py`	constrain to a candidate set	needs tuning
custom char n-gram	any languages you train	needs sample data

The most useful trick across all of them is to restrict the candidate set. If your collection is only ever Latin, French and English, telling the detector that alone removes most errors.

How do I detect languages modern tools do not know?

Train a small character n-gram classifier. A few hundred lines per language usually suffices.

python

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

clf = Pipeline([
    ("vec", CountVectorizer(analyzer="char", ngram_range=(2, 4))),
    ("lr", LogisticRegression(max_iter=1000)),
])
clf.fit(train_lines, train_langs)   # e.g. ["lat","fro","ang", ...]

Character n-grams are spelling-robust, which suits noisy historical text far better than word-based models.

How do I use confidence and metadata together?

Never accept a low-confidence guess silently. Set a threshold (for example, 0.65) and route anything below it to a review queue. Then fuse with what the archive already tells you: a catalogue noting "in French and Latin" constrains the candidate set per item, lifting accuracy with zero extra modelling.

python

THRESH = 0.65
def decide(line, lang, prob, allowed):
    if prob < THRESH or (allowed and lang not in allowed):
        return "REVIEW"
    return lang

How do I evaluate language ID on historical text?

Hand-label a few hundred segments stratified by language and length. Report accuracy per language and per length bucket — overall accuracy hides that short Latin tags or rare-language lines are where the system fails. That breakdown is what tells you whether to trust automated tagging or fall back to review.

Key Takeaways

Segment per line or clause; document-level detection hides code-switching.
Restrict the candidate set to languages your collection actually contains.
fastText lid.176, CLD3 and langid.py are solid starting points.
Character n-gram classifiers handle unsupported and noisy languages well.
Reject segments under roughly 20 characters or below a confidence threshold.
Fuse detection with archive metadata to constrain candidates per item.
Evaluate per language and per length bucket, not just overall accuracy.

Frequently Asked Questions

Why do modern language detectors fail on historical sources?

They are trained on modern, monolingual web text, so archaic spelling, code-switching within a sentence, and very short segments confuse them. They also lack labels for dead or regional languages such as Church Slavonic or Occitan.

What is the smallest text a language detector can handle reliably?

Most detectors need at least 20 to 50 characters for a confident guess, and accuracy falls sharply below that. For short historical entries, combine detection with metadata and manual review.

How do I detect language when it switches mid-sentence?

Segment below the document level, ideally per line or per clause, and run detection on each segment. Document-level detection only reports the majority language and hides the code-switching you often care about.

Which library should I use for historical language identification?

fastText's lid.176 model and CLD3 are strong general starting points, and langid.py is easy to constrain to a fixed language set. For historical languages you usually need to restrict candidates and add custom rules.

Can I detect historical languages that modern tools do not support?

Yes, by training a lightweight character n-gram classifier on samples of those languages. A few hundred lines per language is often enough for usable accuracy on well-defined sets.

Why is this harder than it looks? ​

How should I segment the text? ​

Which tools should I reach for? ​

How do I detect languages modern tools do not know? ​

How do I use confidence and metadata together? ​

How do I evaluate language ID on historical text? ​

Key Takeaways ​

Frequently Asked Questions ​

Why do modern language detectors fail on historical sources? ​

What is the smallest text a language detector can handle reliably? ​

How do I detect language when it switches mid-sentence? ​

Which library should I use for historical language identification? ​

Can I detect historical languages that modern tools do not support? ​

Related reading ​

Why is this harder than it looks?

How should I segment the text?

Which tools should I reach for?

How do I detect languages modern tools do not know?

How do I use confidence and metadata together?

How do I evaluate language ID on historical text?

Key Takeaways

Frequently Asked Questions

Why do modern language detectors fail on historical sources?

What is the smallest text a language detector can handle reliably?

How do I detect language when it switches mid-sentence?

Which library should I use for historical language identification?

Can I detect historical languages that modern tools do not support?

Related reading