Appearance
OCR tables and registers accurately by treating structure recognition as a separate, first-class step before text recognition — detect the grid of rows, columns and cells, then read text into that grid. Run a normal OCR engine over a ruled ledger and it reads left-to-right across the whole page, interleaving columns and collapsing your neat table into an unusable wall of text. The fix is to recover geometry first, so that every value lands in the right cell and empty cells stay empty.
Why does OCR scramble tables into one block?
A line-based OCR engine assumes a single reading order: read this line, then the next. A table has multiple parallel reading orders — one per column — plus row groupings the engine cannot see. Without a structure model, "John Smith | 1847 | Baptism" and the row below it merge, and a blank middle cell silently shifts every following value one column left.
Detect structure, then read cells
The reliable pipeline has three stages:
text
1. Table detection → find the table region on the page
2. Structure recognition → detect row/column separators → cell grid
3. Cell-level OCR/HTR → recognise text within each cell, write to grid[r][c]For printed tables, table-transformer models or Amazon Textract produce a cell grid directly. For handwritten registers, Transkribus' table mode or a custom layout model gives you the grid, then per-cell HTR fills it.
How do I OCR a parish register with fixed columns?
Parish registers, census returns and account books are a gift: every page shares the same ruled columns. Don't re-detect structure each page — define a template once and apply it:
python
# Template-based extraction for a uniform register layout
columns = {
"surname": (0.05, 0.28), # fractional x-bounds of each column
"forename": (0.28, 0.46),
"date": (0.46, 0.62),
"event": (0.62, 0.80),
"notes": (0.80, 0.98),
}
rows = detect_row_separators(page) # horizontal rulings or baselines
for r in rows:
record = {c: ocr_cell(page, r, xb) for c, xb in columns.items()}
writer.writerow(record)Template extraction is far more robust than generic detection because it constrains where every value must be — and it makes empty cells explicit rather than dropped.
Keeping rows aligned when cells are empty
The single most common corruption is the shifted column: a blank cell produces no text, the reader skips it, and every subsequent value moves over by one. Anchor to the grid, not to text presence:
| Approach | Empty-cell behaviour | Robustness |
|---|---|---|
| Read by text flow | Blank dropped, columns shift | Poor |
| Detect cells, read each | Blank → empty value, alignment kept | Strong |
| Template grid | Blank → empty value, fixed schema | Strongest (uniform layouts) |
Always emit a value for every defined cell, even if it is the empty string.
Numbers, currency and the £-s-d trap
Historical ledgers use pre-decimal currency (£ s d), fractions, and ditto marks (″, "do."). OCR mangles these, and a single misread digit corrupts a balance. Treat numeric columns specially: validate against column sums where the register itself totals them, expand ditto marks by carrying the value above, and flag any cell whose OCR confidence is low for human review. Numeric columns deserve manual verification far more than free-text ones.
What output format should table OCR produce?
CSV is fine for a flat, single-value-per-cell register. But keep richer output when you can:
json
{
"row": 12, "col": "date",
"text": "14 Mar 1847", "confidence": 0.71,
"bbox": [412, 880, 530, 912]
}The cell-to-pixel bbox is gold: it lets a reviewer jump straight to the image region for any suspect value, and it preserves provenance. ALTO XML or TEI tables serve the same purpose when you need an archival standard.
Verifying a digitised register
Spot-check by reconstructing: render your extracted grid back over the page image and confirm cells align with the ruling. Then compute per-column error rates separately — a register can be 99% accurate on surnames and 90% on dates, and a single global figure hides exactly the column you most need to trust.
Key Takeaways
- Recover row/column/cell structure before OCR; a plain text flow scrambles every table.
- Use template-based extraction for uniform registers — it constrains values and preserves blanks.
- Anchor to the detected grid so empty cells stay empty and columns never shift.
- Treat numeric and currency columns specially: validate totals, expand ditto marks, flag low confidence.
- Keep cell-to-pixel coordinates so reviewers can verify any value against the image.
- Report per-column error rates; a global accuracy number hides the column you rely on most.
Frequently Asked Questions
Why does OCR scramble tables into one block of text?
Standard OCR reads in a single text flow and ignores cell boundaries, so columns interleave and rows merge. You need a table-structure recognition step that detects rows, columns and cells before reading text into them.
How do I OCR a parish register with consistent columns?
Exploit the fixed layout: define a column template once, segment each page to that grid, then OCR each cell region independently. Template-based extraction beats generic table detection when every page shares the same ruled structure.
What tools recover table structure from scans?
For printed tables, Tesseract with layout output, Amazon Textract, or table-transformer models detect cells; for handwritten registers, Transkribus table mode or a custom layout model plus per-cell HTR works better.
How do I keep rows aligned when cells are empty?
Anchor extraction to the detected grid, not to text presence, so empty cells produce empty values that preserve column alignment. Reading by text flow alone drops blanks and shifts every later value.
Can I extract handwritten ledgers into a spreadsheet?
Yes, by combining table-layout detection with handwritten text recognition per cell, then exporting to CSV. Accuracy depends on ruling clarity and hand consistency; expect manual verification of numeric columns.
Should table OCR output CSV or something richer?
Use CSV for simple flat tables, but prefer a structured format (JSON with cell coordinates, or TEI/ALTO) when you need provenance, confidence scores or multi-line cells. Keep the cell-to-pixel mapping for later correction.