Appearance
For historical scans, use OCR when the page is mechanically printed or typewritten with separated glyphs, and use HTR (Handwritten Text Recognition) when the page is cursive, joined or otherwise handwritten. The two share modern neural architecture, but they are trained on different shapes of data, so matching the engine to the material is the single biggest accuracy decision you make before any transcription begins.
What actually separates OCR from HTR?
Classic OCR was built on the assumption that text is a row of discrete, identically shaped glyphs from a known font. It segments characters, classifies each one, then reassembles words. That works beautifully for a 1960 paperback and falls apart on a 1640 secretary hand where letters merge into continuous ink.
HTR sidesteps per-character segmentation. Modern engines feed a whole text line into a recurrent or transformer network and predict a sequence, so they tolerate joined strokes, variable spacing and ligatures. The irony is that today's best "OCR" engines (Kraken, Tesseract 5's LSTM) use the same line-based sequence modelling, so the distinction is increasingly about the training corpus rather than the algorithm.
When should you reach for HTR instead of OCR?
Route the material, not the era. A printed 1700 broadside is an OCR job; a handwritten 1990 fieldwork notebook is an HTR job.
| Material | Recommended engine | Typical character accuracy |
|---|---|---|
| Clean modern print | OCR (Tesseract) | 98-99% |
| Degraded historical print | OCR with a period model | 90-97% |
| Typewritten documents | OCR | 95-99% |
| Single-scribe manuscript | HTR (trained model) | 92-97% |
| Multi-scribe archival series | HTR (generic + tuned) | 80-92% |
| Mixed print + marginalia | Both, by region | varies |
Do print and handwriting need different pipelines?
Yes, beyond the recognition model itself. Layout analysis differs: printed books have predictable columns and lines, while manuscripts need baseline detection that copes with slanting, crowded and interlinear writing. Preprocessing differs too — heavy binarisation that helps faded print can destroy the faint pen strokes HTR relies on. Keep grayscale for handwriting.
bash
# Print page → Tesseract with a historical model
tesseract page.tif out --psm 4 -l eng_best
# Manuscript page → Kraken: segment then recognise with an HTR model
kraken -i page.tif lines.json segment -bl
kraken -i page.tif out.txt ocr -m my_hand.mlmodelHow much does choosing wrong cost you?
A lot of wasted effort. Running glyph-segmenting OCR on cursive can return character error rates above 50%, which is worse than useless because correcting it takes longer than transcribing from scratch. Conversely, throwing an HTR model at clean print is slower and sometimes less accurate than a dictionary-backed print engine, because the HTR model lacks the print engine's typographic priors.
Can one model handle a mixed collection?
For a collection that blends printed forms with handwritten entries — parish registers, ledgers, completed questionnaires — you have three options:
- Region routing: layout analysis tags printed zones and handwritten zones, then each is sent to the appropriate model. Most robust, more setup.
- A combined model: train HTR on ground truth that includes both print and hand. Simpler to operate, slightly lower peaks.
- Two passes: run print OCR, then HTR, and merge by confidence. Expensive, occasionally worth it for high-value material.
Key Takeaways
- Match the engine to the material: print and typescript to OCR, genuine handwriting to HTR.
- Modern OCR and HTR share neural line-recognition; the real difference is training data.
- Skip aggressive binarisation for handwriting — it erodes the faint strokes HTR needs.
- Expect 98-99% on clean print but 92-97% on a single well-trained hand.
- Running OCR on cursive can exceed 50% CER, costing more than transcribing by hand.
- For mixed collections, region-routing usually beats a single do-everything model.
Frequently Asked Questions
Is HTR just OCR for handwriting?
Functionally yes, but technically they share the same deep-learning core today. The label HTR simply signals a model trained on cursive, joined and variable letterforms rather than discrete printed glyphs.
Can a single model do both print and handwriting?
Mixed models exist and engines like Kraken and Transkribus can recognise both, but accuracy is usually higher when you route printed pages to a print model and manuscript pages to an HTR model.
Why does OCR fail so badly on old handwriting?
Classic OCR assumes separated characters and a fixed typeface. Handwriting has connected strokes, variable spacing and idiosyncratic letterforms, so a glyph-by-glyph approach collapses.
Do I always need to train an HTR model myself?
No. Public models for common scripts (English court hand, German Kurrent, Latin minuscule) often reach usable accuracy. Train only when your hand or language is poorly covered.
What accuracy should I expect from HTR versus OCR?
Good print OCR reaches 98-99% character accuracy; a well-trained single-hand HTR model typically lands at 92-97%, and messy multi-scribe material can sit lower.
Is HTR worth it for typewritten documents?
Usually no. Typescript is regular enough that a print OCR engine handles it cheaply; reserve HTR for genuine manuscript hands.