Appearance
Good OCR preprocessing follows a fixed order — crop to the text block, deskew, then carefully adjust contrast and noise — and it does the least manipulation that still helps. The single most reliable gain is deskewing, while the most overdone step is heavy binarisation, which can erase the faint strokes on faded archival pages. Modern LSTM engines often prefer clean grayscale to harsh black-and-white, so every change should be validated by measuring character error rate rather than judged by how "clean" the page looks.
Why does preprocessing order matter?
Each step depends on the last. If you binarise before deskewing, the threshold is computed on tilted text and bleeds across line edges. If you denoise before cropping, you waste effort on margins you will discard. The dependable sequence is:
- Crop to the page or text block.
- Deskew so lines are horizontal.
- Normalise illumination (fix uneven lighting / show-through).
- Optionally binarise.
- Optionally denoise — lightly.
How do you deskew reliably?
Deskewing is the highest-value, lowest-risk step. A 3-degree tilt is enough to push descenders into the line below and confuse segmentation.
bash
# ImageMagick automatic deskew, then trim and flatten
magick scan.tif -deskew 40% -fuzz 5% -trim +repage deskewed.tifFor batches with consistent tilt, measure the angle once and apply it across the set; for mixed material, per-image automatic deskew is safer. Always check a few results, because automatic deskew can over-rotate pages dominated by tables or images.
Should you binarise, and how?
Binarisation — converting to pure black and white — used to be mandatory but is now optional and frequently harmful on historical pages. Test before committing.
| Page type | Recommended approach |
|---|---|
| Clean high-contrast print | Adaptive (Sauvola) binarisation |
| Faded or low-contrast ink | Grayscale, no binarisation |
| Show-through / bleed | Illumination correction first |
| Handwriting | Grayscale (never harsh threshold) |
Sauvola and other adaptive methods compute a local threshold per region and beat a single global threshold on uneven pages:
python
from skimage.filters import threshold_sauvola
from skimage.io import imread, imsave
img = imread("deskewed.tif", as_gray=True)
t = threshold_sauvola(img, window_size=25)
imsave("binary.tif", (img > t).astype("uint8") * 255)How do you fix faded ink and show-through?
Faded archival pages and bleed-through from the verso are the two classic enemies. Background normalisation (subtracting a blurred estimate of the page background) lifts faint foreground without crushing it:
bash
# Estimate background with a large blur, divide it out
magick deskewed.tif ( +clone -blur 0x30 ) \
-compose Divide_Src -composite normalised.tifThis recovers pale text far more gently than thresholding, which tends to drop the faintest strokes entirely. Reserve aggressive contrast stretching for genuinely high-contrast print.
How much denoising is too much?
Denoising is where good intentions damage accuracy. A light median or non-local-means filter removes speckle from dusty scans; turn it up and it sands the serifs off letters and thins handwriting strokes until the model can no longer read them. The rule: apply the lowest setting that visibly removes specks, then stop. If denoising lowers CER, keep it; if it does not, leave the page alone.
How do you validate the recipe?
Treat preprocessing as an experiment, not an art project:
- Transcribe a handful of representative pages as ground truth.
- OCR the raw scans; record the baseline CER.
- Add one preprocessing step; re-OCR the same pages.
- Keep the step only if CER drops; otherwise revert.
- Lock the winning recipe and apply it to the full batch.
Key Takeaways
- Fixed order: crop, deskew, normalise illumination, then optionally binarise and denoise.
- Deskewing is the highest-value, lowest-risk step — do it first after cropping.
- Modern LSTM engines often prefer grayscale; binarise only if CER improves.
- Use adaptive (Sauvola) thresholding, not a global threshold, when you do binarise.
- Fix faded ink and show-through with background normalisation, not harsh thresholds.
- Validate every step against CER on ground truth; cleaner-looking is not always better.
Frequently Asked Questions
What is the correct order of preprocessing steps?
Crop to the text block, deskew, then handle contrast and noise. Cropping first stops layout analysis being distracted, and deskewing before binarisation keeps thresholds accurate.
Should I binarise images for modern OCR engines?
Not always. LSTM engines like Tesseract 5 and Kraken often perform best on grayscale, especially for faded or show-through pages. Binarise only if it measurably lowers CER.
How does deskewing improve OCR?
Even a few degrees of rotation confuses line detection and shifts characters across line boundaries. Deskewing straightens text so segmentation and recognition both work better.
What removes show-through and bleed-from-the-back?
Gentle background normalisation or a high-pass/illumination correction reduces show-through. Aggressive thresholding can hide it but often damages faint foreground text too.
Is denoising always beneficial?
No. Light denoising helps speckled scans, but heavy denoising erodes thin strokes and serifs, which hurts recognition. Use the lightest setting that visibly cleans the page.
What resolution should the scans be before OCR?
300-400 ppi lossless (TIFF or PNG) is the practical sweet spot for print. Avoid JPEG, whose compression artefacts around letter edges reduce accuracy.