Appearance
Fix OCR errors on special characters by identifying which errors are systematic glyph confusions (and fixable with rules) versus genuine recognition failures (which need a model that covers the character set). The long s read as f, ligatures dropped, diacritics stripped — these are the signature failures of running modern-type OCR over early-modern printing. The good news: most are predictable, so a mix of the right model and a handful of targeted post-correction rules recovers the typography cleanly.
Why does OCR read the long s as an f?
The long s (ſ) and lowercase f differ only by a crossbar that, in the long s, stops at the stem (or is absent). Models trained on modern type have never seen ſ, so they snap it to its nearest neighbour, f. You get ſuſpicion rendered as fufpicion.
Three fixes, in order of preference:
- Use an early-modern model whose alphabet contains
ſ(Transkribus' "Noscemus" or "Print M" families; community Kraken models for 17th–18th c. print). - Post-correct by context: a long-s error is almost never word-final and rarely doubles as
ffin modern words. - Normalise
ſ→sif your edition does not need the original glyph.
python
import regex as re
# Conservative: only flip 'f' to 's' where an early-modern long-s is plausible.
# Avoid word-final and common ff/ft clusters that are genuinely 'f'.
def restore_long_s(word):
if re.search(r'f$', word): # long s is not word-final
return word
return re.sub(r'f(?=[aeiouhlt])', 's', word)Always validate rules against a sample — over-eager substitution turns real f words into nonsense.
Should I keep ligatures or expand them?
Distinguish linguistic ligatures from typographic ones:
| Glyph | Type | Recommended handling |
|---|---|---|
æ œ | Linguistic | Keep as codepoint (carries meaning) |
fi fl ff | Typographic | Expand to fi fl ff for search |
st ſt | Typographic | Expand to st ſt |
& (et) | Abbreviation | Keep, but index "and" as a variant |
For a searchable archive, expand typographic ligatures but store the original in a diplomatic layer if you maintain one. Expanding is a normalisation, not data loss, when you record both forms.
How do I stop OCR dropping diacritics?
Diacritics vanish for one of two reasons: the codepoint is absent from the model's alphabet, or the output is in NFD and a later step strips combining marks. Check the alphabet file first:
bash
# List unique codepoints your model can emit
python -c "print(open('alphabet.txt',encoding='utf-8').read())"If é à ñ ç are missing, no amount of post-processing recovers them — fine-tune on accented ground truth. Then normalise:
python
import unicodedata
text = unicodedata.normalize("NFC", raw_ocr)Fixing punctuation: dashes, quotes and hyphens
Em and en dashes (— –), curly quotes (“ ” ‘ ’) and the soft hyphen are chronically under-represented in training data, so OCR flattens them to - and ". Because these are deterministic, a punctuation pass is safe:
python
fixes = {"--": "—", "''": "”"} # tune to your corpus conventionsBe careful with line-end hyphens: a hyphen splitting a word across lines (infor- / mation) should be de-hyphenated and joined, but a real compound hyphen must survive.
When is retraining worth it over rules?
Rules win when errors are systematic and context-light. Retraining wins when a special character appears thousands of times in shifting contexts where rules misfire. A practical threshold: if a single rule reaches over 95% precision on a validation sample, ship the rule; if you are writing your tenth special-case exception, fine-tune instead.
Verifying the fix
Measure character error rate on a held-out page before and after each rule, and log which rule changed which token. A correction pass that lowers global CER but corrupts a rare-but-correct form is a regression you only catch with per-rule auditing.
Key Takeaways
- The long s →
ferror is systematic; fix it with an early-modern model or a context-aware rule, not blanket substitution. - Keep linguistic ligatures (
æ,œ) as codepoints; expand typographic ones (fi,fl) for searchability. - Dropped diacritics usually mean the codepoint is missing from the model's alphabet — fine-tune, don't post-process.
- Normalise to NFC for storage and search; keep a diplomatic layer if you need original glyphs.
- Punctuation and dash errors are deterministic and safe to fix with a normalisation pass.
- Audit every correction rule against held-out CER so a fix never quietly introduces new errors.
Frequently Asked Questions
Why does OCR read the long s as an f?
The long s (ſ) and lowercase f are nearly identical glyphs — the long s lacks the full crossbar — so models trained on modern type map it to "f". Fix it with an early-modern model, a targeted post-correction rule, or by normalising ſ to s explicitly.
Should I keep ligatures as Unicode or expand them?
Keep meaningful ligatures (æ, œ) as their own codepoints because they carry linguistic value; expand purely typographic ligatures (fi, fl) to their letters for searchability, ideally recording both forms in your transcript.
How do I stop OCR from dropping diacritics?
Ensure the model's character set includes every accented codepoint you expect, train or fine-tune on accented examples, and normalise output to NFC so combining marks do not split. Missing codepoints in the alphabet are the usual cause.
Can I fix special-character errors without retraining?
Yes — many are systematic (ſ→f, fi→fi failures) and respond to regex post-correction rules keyed to context. Retraining only pays off when errors are frequent and context-dependent enough that rules misfire.
What Unicode form should historical transcripts use?
Use NFC (precomposed) for storage and search interoperability, but keep a record of original glyph forms (long s, ligatures) if your edition needs diplomatic fidelity. Normalise consistently across the whole corpus.
Why do em dashes and quotation marks come out wrong?
OCR often maps en/em dashes and curly quotes to hyphens or straight quotes because training data under-represents them. A punctuation normalisation pass fixes most cases without touching the model.