Appearance
OCR multilingual and code-switched pages by segmenting first, identifying the script per region, then routing each region to a model that actually covers that script — rather than loading every language into one engine and hoping. The single biggest mistake is treating a page as one language: the moment Latin marginalia sit beside a Greek quotation or a German body text breaks into Hebrew, a monolingual model silently corrupts the minority language. Below is a workflow that keeps each language at full accuracy.
Why does adding a second language make OCR worse?
When you pass -l lat+grc to Tesseract, you are not telling it "Latin here, Greek there" — you are merging both character sets and dictionaries into one decoder. On a line of Latin, the Greek dictionary now competes for ambiguous glyphs, and language-model biasing pulls plausible Latin words toward Greek lookalikes. Accuracy on the majority language can drop several CER points for no gain.
The rule: only co-load languages that genuinely share lines. Reserve multi-language flags for true code-switching; otherwise route.
Segment before you recognise
Layout analysis comes first because script identification needs clean regions. Use Kraken's segmenter or your existing layout step to produce line or block polygons, then classify each one:
bash
# Kraken: segment, then recognise each region with the matching model
kraken -i page.png lines.json segment -bl
kraken -i page.png region_grc.txt ocr -m greek_polytonic.mlmodel
kraken -i page.png region_lat.txt ocr -m latin_print.mlmodelFor Tesseract pipelines, run orientation and script detection (OSD) per block:
bash
tesseract block.png stdout --psm 1 -c tessedit_do_invert=0 \
-l script/LatinHow do I detect which language a region is in?
Three layers, cheapest first:
- Script ID (Latin vs Greek vs Cyrillic vs Arabic) — a fast CNN or Tesseract's OSD. This alone resolves most mixed-script pages.
- Language ID within a script (French vs Latin) — needed only when dictionaries differ; a tiny n-gram classifier on a first-pass transcript works.
- Word-level routing for code-switched lines — classify each token's script and assemble with a shared Unicode output.
| Situation | Best strategy | Engine fit |
|---|---|---|
| Distinct blocks per language | Segment + route per block | Tesseract, Kraken |
| Greek quotation inside Latin line | Multilingual line model | PyLaia, Transkribus |
| Latin + Hebrew, opposite directions | Per-region BiDi handling | Kraken + manual reorder |
| Same script, two languages | One model, per-region dictionary | Kraken |
Handling Latin and Greek together
Polytonic Greek is the classic trap: diacritics (ά ῆ ᾳ ὧ) multiply the character set, and a Latin model lacks the codepoints entirely, so it substitutes nearest Latin shapes. Train or pick a model whose character set is the union of both scripts and whose ground truth includes mixed lines. Normalise output to NFC so that precomposed and combining forms do not split your search index later.
Building a code-switch-aware training set
If you must train, your ground truth must contain the switching itself — lines that genuinely mix scripts, transcribed in correct Unicode and reading order. A few hundred such lines, added to monolingual baselines, teach the model the transitions. Keep a single shared alphabet file:
text
# alphabet.txt — union of all scripts, one codepoint per line
a b c ... ω ή ᾳ ‘ ’ … ſDon't let BiDi and reading order scramble output
Right-to-left scripts (Hebrew, Arabic) interleaved with left-to-right Latin produce logically correct but visually confusing strings. Store text in logical order with Unicode BiDi control characters, and let the display layer reorder — never bake visual order into the transcript, or downstream search breaks.
Key Takeaways
- Segment and identify script per region before recognition; never assume one page is one language.
- Co-loading languages in one engine usually lowers accuracy — route distinct blocks to dedicated models instead.
- For true code-switching, use a multilingual model trained on mixed lines with a shared Unicode alphabet.
- Polytonic Greek and Hebrew need full character-set coverage; a Latin model will silently substitute glyphs.
- Normalise to NFC and store RTL text in logical order to keep search and BiDi intact.
- Kraken and PyLaia handle per-line multilingual routing better than vanilla Tesseract.
Frequently Asked Questions
Can a single OCR model read Latin and Greek on the same page?
A model whose training data and character set cover both scripts can, but most off-the-shelf single-language models cannot. For mixed Latin/Greek pages, either use a polytonic-aware multilingual model or detect and route each region to a script-specific engine.
Why does adding a second language make my OCR worse?
Loading multiple Tesseract language packs widens the search space and lets the wrong language's dictionary "win" on ambiguous characters. Restrict languages per region and disable dictionary biasing where scripts diverge sharply.
How do I detect which language a text region is in?
Run script identification (Tesseract OSD, or a CNN script classifier) on each segmented line or block before recognition, then route the region to the matching model. Glyph-level script detection beats whole-page language guessing for code-switched text.
Does code-switching within a single line break OCR?
It can, because line-level language assumptions fail mid-line. The robust fix is a multilingual recognition model trained on mixed lines, or word-level routing with a shared Unicode character set.
Which engines handle multilingual pages best?
Kraken and PyLaia support custom multilingual character sets and per-line script routing well; Tesseract works for additive Latin-script languages but struggles when scripts share glyph shapes. Transkribus multilingual models suit manuscript material.