Skip to content
OCR & HTR Pipelines

To improve OCR accuracy on old printed books, work in three layers: clean the image (deskew, crop, gentle denoise), pick a model trained on historical typography rather than a generic modern one, and constrain output with a period-appropriate dictionary or language model. Then measure character error rate (CER) on a ground-truth sample so each change is judged by numbers, not impressions. On degraded 17th- and 18th-century type, these steps routinely move CER from double digits down to 2-4%.

Why does generic OCR struggle with old type?

Modern OCR models are trained on contemporary fonts and assume modern orthography. Historical print breaks both assumptions: the long-s (ſ) looks like an f, ligatures like st and ff fuse glyphs, ink spread thickens strokes, and uneven inking leaves letters half-printed. A model that has never seen these patterns guesses wrong confidently. The fix is a model with historical priors plus preprocessing that recovers contrast without destroying fine serifs.

Which preprocessing steps matter most?

Order matters. Crop to the text block first so layout analysis is not distracted by gutters and fingers, then deskew, then handle contrast.

bash
# ImageMagick: trim, deskew, gentle contrast for print
magick page.tif -deskew 40% -fuzz 10% -trim +repage \
  -colorspace Gray -normalize page_clean.tif

Avoid the temptation to over-process. Aggressive thresholding shatters thin serifs and turns the long-s into noise. A useful rule: change one thing, re-run OCR on a fixed test page, and keep the change only if CER drops.

How do I pick the right OCR model?

This is where the largest gains hide. For Latin-script print before about 1800, a model that includes period letterforms beats the default by a wide margin.

MaterialBetter Tesseract modelWhy
English print 1700-1900eng_best (LSTM)Stronger language model
Fraktur / blackletterfrak or FrakturKnows blackletter shapes
Early Latin printlat / script LatinHandles long-s, ligatures
Multilingual pagecombine with +e.g. -l eng+lat+fra

For the most demanding pages, training a small fine-tuned model on a few hundred lines of your own ground truth often outperforms any off-the-shelf option.

Can dictionaries and language models reduce errors?

Yes, but carefully. A modern dictionary will "correct" archaic spellings into wrong modern words — turning publick into public or shew into show. Build a period wordlist or feed Tesseract a custom user-words file:

bash
tesseract page_clean.tif out --psm 4 \
  -l eng_best --user-words period.lst

A character n-gram language model trained on similar-era text is even better, because it scores letter sequences the page is likely to contain rather than forcing dictionary words.

What does a measurable improvement loop look like?

  1. Transcribe 5-10 representative pages by hand as ground truth.
  2. Run a baseline OCR pass and compute CER and WER.
  3. Apply one change (model, preprocessing or dictionary).
  4. Re-run and compare CER on the same pages.
  5. Keep changes that help, revert those that do not, and document the winning recipe so it is reproducible across the volume.

How much resolution and what file format?

Scan at 300-400 ppi in a lossless format (TIFF or lossless PNG). JPEG compression artefacts around letter edges actively hurt recognition. Going above 400 ppi rarely helps print and slows every downstream step; reserve very high resolution for tiny footnote type or fragile detail.

Key Takeaways

  • The biggest single win is a model trained on historical typography, not generic OCR.
  • Crop, then deskew, then adjust contrast — and avoid over-aggressive binarisation.
  • Scan at 300-400 ppi, lossless TIFF/PNG; JPEG edge artefacts cost accuracy.
  • Use period-appropriate dictionaries; modern ones "correct" archaic spellings wrongly.
  • Normalise the long-s and ligatures in post-processing, never before recognition.
  • Always measure CER on a fixed ground-truth sample before keeping any change.

Frequently Asked Questions

What single change improves OCR accuracy the most on old print?

Using a model trained on historical typography (long-s, ligatures, period fonts) instead of a generic modern model usually delivers the biggest jump, often several percentage points of character accuracy.

Should I binarise scans before OCR?

For clean, high-contrast print, adaptive binarisation helps. For faded or show-through pages, modern LSTM engines often do better on grayscale, so test both and measure CER.

Does higher scan resolution always help?

Up to a point. Around 300-400 ppi is the sweet spot for most print; beyond that you mostly add file size and noise without recognition gains.

How do I handle the long-s and old ligatures?

Use a model that knows them (Tesseract's script/Latin or a Fraktur model) and normalise in post-processing, mapping the long-s to a regular s only after recognition, not before.

Is a custom dictionary worth building?

Yes for specialised vocabularies, place names and archaic spellings. A targeted wordlist or language model reduces plausible-but-wrong substitutions that a generic dictionary introduces.

How do I know my changes actually helped?

Measure character error rate against a small ground-truth sample before and after each change. Trust numbers, not the eyeball impression of a single page.