Skip to content
OCR & HTR Pipelines

Batch OCR thousands of pages reliably by making the job idempotent, checkpointed and crash-tolerant: every page is an independent unit, its output is written atomically, and re-running the command resumes rather than restarts. The failure mode that ruins long digitisation runs is the all-or-nothing script — it processes 9,400 pages, hits a corrupt scan, throws, and you lose hours. The architecture below ensures any single page can fail or any machine can crash without costing you the rest of the batch.

How do I resume after a crash?

Idempotency is the whole game. Before processing a page, check whether its output already exists; if it does, skip it. Now re-running the exact same command continues from where it died, no bookkeeping required:

python
from pathlib import Path

def process_page(img_path, out_dir):
    out = Path(out_dir) / (img_path.stem + ".txt")
    if out.exists():
        return "skip"                      # already done — resume-safe
    try:
        text = run_ocr(img_path)
        tmp = out.with_suffix(".tmp")
        tmp.write_text(text, encoding="utf-8")
        tmp.replace(out)                   # atomic: never a half-written file
        return "ok"
    except Exception as e:
        log_failure(img_path, e)
        return "fail"

The .tmp + replace() pattern is critical: it guarantees a crash mid-write never leaves a truncated output that the resume logic would mistake for "done".

A manifest is your source of truth

Track every page in a manifest — SQLite or even a CSV — with a status column. It gives you resumability, live progress, and a failure list in one place:

sql
CREATE TABLE pages (
  path TEXT PRIMARY KEY,
  status TEXT DEFAULT 'pending',   -- pending | done | failed
  error TEXT,
  ms INTEGER
);

Query WHERE status='pending' to feed workers and WHERE status='failed' to build a retry pass. Never infer progress by counting files alone — the manifest distinguishes "not yet attempted" from "attempted and failed".

How do I stop one bad image killing the run?

Isolate every page. The try/except in the snippet above logs the offender and moves on, so a corrupt JPEG, a zero-byte scan, or a memory spike costs you exactly one page. Collect failures and retry them separately, often with different settings (re-decode, downscale, alternate engine).

text
Run 1: process all pending → 9,830 ok, 170 failed (logged)
Run 2: process only failed  → fix decoder, 160 recover, 10 truly bad → human review

Should I process pages in parallel?

Yes — per-page OCR is embarrassingly parallel. Use a worker pool, keep workers stateless, and size the pool to your hardware:

python
from concurrent.futures import ProcessPoolExecutor
import os

workers = max(1, os.cpu_count() - 1)        # CPU OCR: leave one core free
with ProcessPoolExecutor(max_workers=workers) as pool:
    for result in pool.map(process_with_dir, pending_pages):
        update_manifest(result)
WorkloadWorker sizingWatch for
CPU OCR (Tesseract)cores − 1memory per worker
GPU HTR inference1–2 per GPU, batch internallyVRAM thrashing
Mixed I/O-boundmore workers than coresdisk/network saturation

Because each worker is stateless and writes its own output, one crashing worker never corrupts shared state.

Throttling, memory and the long tail

Long runs leak: image libraries hold buffers, some scans are enormous. Defend against it by capping per-page memory, downscaling oversized images before OCR, and recycling worker processes periodically (most pools support max_tasks_per_child) so leaked memory is reclaimed. Set a per-page timeout too — a single pathological image can otherwise hang a worker indefinitely.

Per-page output now, aggregate later

Write one output file per page during the run for isolation and resumability; aggregate into a corpus file or database only as a final step:

bash
# Final aggregation once the batch is complete and verified
python aggregate.py --in out/ --out corpus.sqlite

A single giant output file accumulated during processing is fragile — one crash and it is ambiguous how much is valid.

Verifying a completed batch

A finished run is not a correct run. Cross-check counts (input pages == done + failed), sample random outputs for empty or garbage text (a sign of a silently misconfigured engine), and confirm every page in the manifest reached a terminal status. Only then aggregate and report.

Key Takeaways

  • Make every page idempotent — check for existing output and skip — so re-running resumes automatically.
  • Write outputs atomically (.tmp then rename) so a crash never leaves a half-written "done" file.
  • Keep a manifest (SQLite/CSV) as the single source of truth for progress, resumability and retries.
  • Isolate each page in try/except; collect failures into a separate retry pass instead of aborting.
  • Parallelise with stateless workers sized to CPU cores or GPU throughput; recycle workers to cap memory.
  • Store per-page output during the run and aggregate only at the end, after verifying counts and samples.

Frequently Asked Questions

How do I resume an OCR batch after a crash?

Make the job idempotent: check for the output file before processing each page and skip it if present. With per-page outputs and a manifest of completed work, re-running the same command simply continues from where it stopped.

Should I process pages in parallel?

Yes — OCR is embarrassingly parallel per page, so a worker pool sized to your CPU cores or GPU throughput gives near-linear speedup. Keep workers stateless so one crash never corrupts the whole run.

How do I stop one bad image from killing the whole run?

Wrap each page in try/except, log the failure with the filename and error, write a marker, and continue. Collect failures into a retry list rather than aborting the batch.

What's the best way to track progress on a long run?

Maintain a manifest (CSV or SQLite) with one row per page and a status column, updated as each finishes. It gives you resumability, progress counts and a failure list in one place.

How many OCR workers should I run?

For CPU OCR, match workers to physical cores minus one; for GPU inference, batch pages per GPU and run one or two workers per card to keep it saturated without thrashing memory.

Should I store OCR output per page or in one big file?

Store per page during processing for resumability and isolation, then aggregate into a corpus file or database at the end. One giant output file is fragile and impossible to resume cleanly.