Appearance
Building an HTR pipeline from scratch means chaining six stages: acquire and clean the images, run layout analysis and line segmentation, transcribe ground truth, train a recognition model, run recognition, then post-correct and export. Get each stage right in order — and validate segmentation before training — and a single-scribe collection can reach 92-97% character accuracy. This guide uses Kraken as the open-source backbone, but the same stages apply in Transkribus or eScriptorium.
Stage 1: How do you prepare the images?
Start with lossless TIFFs at 300-400 ppi. Crop to the page, deskew, and keep grayscale — do not binarise handwriting, because thresholding erases faint pen strokes that the model needs.
bash
magick raw.tif -deskew 40% -trim +repage \
-colorspace Gray -normalize clean.tifConsistency matters more than perfection. A uniform input (same crop logic, same contrast handling) lets the model learn the hand rather than the scanning noise.
Stage 2: How does layout analysis and segmentation work?
This is the stage that makes or breaks everything downstream. First detect text regions to exclude marginalia and decoration, then detect baselines within each region.
bash
# Baseline segmentation with Kraken
kraken -i clean.tif lines.json segment -blOpen the output and check that every line of writing has exactly one baseline and that interlinear additions are handled the way you want. If baselines are wrong, fix segmentation now — a perfect recognition model cannot rescue broken lines.
Stage 3: How do you create ground truth?
Ground truth is line images paired with their correct transcription. Transcribe consistently using written guidelines: decide how to handle abbreviations, capitalisation, line breaks and damaged text before anyone transcribes a word, or your training data will contradict itself.
| Target | Words of ground truth | Notes |
|---|---|---|
| Fine-tune existing model | 500-2,000 | Fastest path if a base model fits |
| New single-hand model | 5,000-15,000 | ~50-150 pages |
| Multi-scribe series | 20,000+ | Diversity matters more than volume |
Stage 4: How do you train the model?
Split your data: roughly 90% for training, 10% held out for validation. Train on a GPU if at all possible — CPU training is many times slower.
bash
# Train a Kraken recognition model from ground truth
ketos train -o my_model -f alto train/*.xml \
--partition 0.9 --device cuda:0Watch the validation accuracy curve. Stop when it plateaus; training past that point overfits to your sample and hurts generalisation. Save the best-performing checkpoint, not the last one.
Stage 5: How do you run recognition and check quality?
Apply the trained model to fresh, unseen pages and measure CER against a small ground-truth set you did not train on.
bash
kraken -i new_page.tif out.txt segment -bl ocr -m my_model_best.mlmodelIf accuracy disappoints, diagnose before retraining: inspect segmentation first, then look at which characters the confusion errors cluster around. Often a handful of confusable letterforms (long-s/f, u/n) account for most of the error.
Stage 6: How do you post-correct and export?
No HTR output is publication-ready raw. Build a lightweight correction step:
- Use confidence scores to flag low-certainty lines for human review.
- Apply rule-based fixes for systematic substitutions the model repeats.
- Export to a structured format — ALTO or PAGE XML preserves coordinates, or TEI for scholarly editions.
What does the whole pipeline look like end to end?
Acquire → Clean → Segment regions → Detect baselines
→ Transcribe ground truth → Train → Recognise
→ Post-correct → Export (ALTO / PAGE / TEI)Treat it as a loop, not a line: each round of corrections becomes more ground truth, and periodically retraining on the growing corpus steadily lifts accuracy across the collection.
Key Takeaways
- The six stages are clean, segment, transcribe, train, recognise, post-correct/export.
- Keep handwriting in grayscale — binarisation destroys the strokes HTR needs.
- Segmentation quality dominates results; validate baselines before training.
- Budget 5,000-15,000 words of ground truth for a solid single-hand model.
- Train on GPU, hold out 10% for validation, and keep the best checkpoint.
- Feed corrections back as new ground truth and retrain to keep improving.
Frequently Asked Questions
How much ground truth do I need to train an HTR model?
For a single consistent hand, 5,000-15,000 transcribed words (roughly 50-150 pages) usually gives a usable model. Fine-tuning an existing model can work with far less, sometimes a few hundred lines.
What are the core stages of an HTR pipeline?
Image acquisition and cleanup, layout analysis and line segmentation, transcription of ground truth, model training, recognition, and post-correction with export. Each stage feeds the next.
Can I build a pipeline without coding?
Yes. Transkribus and eScriptorium provide graphical end-to-end workflows. Coding with Kraken gives more control and reproducibility but is optional for many projects.
How long does training an HTR model take?
On a modern GPU, training a Kraken model on a few thousand lines often takes a few hours. On CPU it can take much longer, which is why GPU is recommended for training.
What is the most common cause of poor HTR results?
Bad line segmentation. If baselines are wrong, even a good recognition model produces garbage, so validate segmentation before blaming the model.
Should I segment by lines or regions first?
Detect text regions first to exclude marginalia and decoration, then detect baselines or lines within each region. This ordering keeps unrelated text out of your transcription stream.