Appearance
To detect handwritten versus typed regions reliably, classify at the region or line level — not the whole page — and base the decision on stroke-width variance, baseline regularity and connected-component spacing. Most production failures come from one of three things: low scan resolution destroying stroke texture, a binary classifier that has no class for stamps and graphics, or applying a page-level label to a page that mixes both modalities. Fix those three and accuracy on real archival material typically jumps from the 70s into the low 90s.
This guide is a troubleshooting walkthrough: each section names a symptom, the root cause, and the concrete fix.
Why does my classifier label printed text as handwriting?
This is the single most common error, and it is almost always a scan-quality problem, not a model problem. Clean print has near-constant stroke width and a dead-straight baseline. But degraded print — foxed paper, broken type, ink bleed, JPEG compression artefacts — produces irregular stroke widths that look statistically like handwriting.
Check the histogram of stroke widths in a misclassified region before blaming the model:
python
import numpy as np
from skimage.morphology import skeletonize
from scipy.ndimage import distance_transform_edt
def stroke_width_stats(binary): # binary: ink=True
dt = distance_transform_edt(binary)
widths = dt[skeletonize(binary)] * 2
widths = widths[widths > 0]
return widths.mean(), widths.std() / max(widths.mean(), 1e-6)
# low coefficient-of-variation (<~0.35) => likely print
mean_w, cv = stroke_width_stats(region_mask)If the coefficient of variation is borderline, re-scan at 300-400 dpi and binarise (Sauvola or Otsu) before classifying. The texture you need lives in those pixels.
Should I classify whole pages or individual regions?
Region-level, for anything real. A printed form with handwritten answers, a typed letter with a signature, a ledger with a printed rule and inked entries — all are single pages with both modalities. A page-level label is wrong by construction.
The robust pipeline is two stages:
- Segment the page into regions/lines with a layout model.
- Classify each region's modality independently.
| Approach | Granularity | Strength | Weakness |
|---|---|---|---|
| Page-level CNN | Whole page | Fast, simple | Fails on mixed pages |
| Region classifier | Text region | Handles forms, annotations | Needs good segmentation first |
| Line-level classifier | Single line | Best for interlinear notes | More regions to label, slower |
Can layout tools detect modality automatically?
Partly. Kraken's blla baseline model and Transkribus's P2PaLA can emit typed region labels, and you can train them to output a printed vs handwritten region type. But generic segmentation models only draw boxes — they do not reliably tag modality on their own. Treat modality as a separate classification head you train, not a free side-effect of segmentation.
A pragmatic recipe: segment with Kraken, then run a small CNN (a fine-tuned MobileNetV3 on 64×256 region crops) as the modality classifier. A few thousand labelled region crops per modality is usually enough.
How do I route regions to the right engine?
Once each region carries a modality tag, write it into your PAGE or ALTO XML and split the workload:
bash
# pseudo-routing after classification
for region in page.regions:
if region.modality == "printed":
kraken -i crop.png out.txt ocr -m printed_model.mlmodel
else:
kraken -i crop.png out.txt ocr -m htr_model.mlmodelThen merge recognised text back by region ID so the reading order is preserved. Sending handwriting to a print OCR engine is where character error rates explode — the routing step is the whole point of detecting modality.
Why are stamps, signatures and marginalia misclassified?
Because a binary classifier has nowhere to put them. Stamps have heavy uniform ink, signatures are extreme cursive, marginalia mixes scripts. Add a third class — other/graphic — or set a confidence threshold (say 0.85) below which a region is flagged for human review rather than force-labelled. On archival forms this single change removed most of the embarrassing errors in our pipeline.
What about mixed-modality lines?
Interlinear corrections and a printed line with a handwritten insertion break region-level logic too. Drop to line-level segmentation and, for the rare line that is genuinely mixed, split it at the connected-component gap where stroke statistics change. It is more work, but for annotated print it is the only honest answer.
Key Takeaways
- Classify modality at the region or line level; page-level labels fail on real mixed documents.
- Most "print read as handwriting" errors are scan-quality problems — re-scan at 300-400 dpi and binarise first.
- Use stroke-width coefficient of variation and baseline regularity as cheap, interpretable features.
- Add a third class (stamps/signatures/graphics) or a confidence threshold; never force a binary choice.
- Segmentation does not equal modality labelling — train a separate classifier head.
- Tag modality in PAGE/ALTO XML and route printed vs handwritten regions to different engines.
- Below ~200 dpi the stroke texture that separates the two modalities is gone — fix the source, not the model.
Frequently Asked Questions
Why does my classifier label printed text as handwriting?
The usual cause is degraded print — broken type, ink bleed or low-resolution scans — whose stroke-width variance mimics handwriting. Re-scan at 300-400 dpi and binarise before classifying, and the false positives usually fall away.
Can layout analysis tools tell handwritten and typed apart automatically?
Some can. Kraken's blla model and Transkribus P2PaLA emit region types, and you can train a region classifier to tag zones as 'printed' or 'handwritten'. Out of the box, generic layout models only segment regions; they do not always label modality reliably on mixed pages.
What resolution do I need for reliable modality detection?
Aim for at least 300 dpi at the original page size. Below ~200 dpi the stroke micro-texture that separates print from script is lost, and both rule-based and CNN classifiers degrade sharply.
Should I classify whole pages or individual regions?
Classify at the region or line level for mixed documents such as forms, annotated print and ledgers. Page-level classification only works when each page is purely one modality, which is rare in real archives.
How do I route each region to the right engine after classification?
Tag each region with its predicted modality in your ALTO or PAGE XML, then send printed regions to an OCR engine (Tesseract, Kraken's printed models) and handwritten regions to an HTR engine, merging the recognised text back by region ID.
Why are stamps and signatures misclassified?
Stamps, signatures and marginalia have stroke characteristics unlike both clean print and running hand, so binary classifiers guess. Add a third 'other/graphic' class or a confidence threshold that flags ambiguous regions for human review.