Appearance
Transkribus can transcribe tables and registers through its dedicated table mode: you draw a table region, define its rows and columns, recognise each cell, and the result exports with the grid intact — so a parish register or account book becomes structured data rather than a flat wall of text. The key is to define the cell structure explicitly before recognition, because that is what carries row-and-column position into the PAGE XML and lets you export clean CSV or a spreadsheet downstream.
Why not just use ordinary line detection?
Standard layout analysis draws baselines across the page, ignoring column boundaries. On a table that is a disaster: values from three columns get strung onto one line, and the relationship between "name," "date," and "amount" is lost.
text
Plain line detection on a register:
"Maria Schmidt 1742 3 fl 12 kr" ← one undifferentiated line
Table mode:
| name | year | sum |
| Maria Schmidt | 1742 | 3 fl 12 kr |Table mode preserves which cell each value belongs to. That is the whole difference between a transcript and structured data.
How do I set up a table in Transkribus?
The workflow on a single page:
- Draw a table region over the ruled area.
- Add column separators along the vertical rulings.
- Add row separators along the horizontals.
- Let recognition run per cell so each cell gets its own baseline(s) and text.
- Check that every cell maps to the right row/column index.
Automatic table detection can do the grid for you on cleanly ruled tables; for faint or hand-drawn rulings, place the separators manually.
Should I detect automatically or build a template?
It depends on the regularity of the source.
| Source type | Best approach | Why |
|---|---|---|
| Printed, cleanly ruled table | Automatic detection | Rulings are crisp; little correction |
| Consistent register, faint rules | Manual grid as a template | Same columns every page |
| Irregular / merged cells | Manual, per page | Structure varies too much |
Most historical registers keep the same columns page after page — births, marriages, tax rolls. Build the grid once on a representative page and reuse it as a template across the document; only the contents change.
How do I handle merged and empty cells?
These break naive transcription, so handle them deliberately:
- Merged cells: define one cell that spans the rows or columns it physically covers — do not split a merged value into phantom cells.
- Empty cells: leave them empty; never delete the cell. Deleting collapses the column and misaligns every row below it.
- Spanning headers: model a header row as its own row of cells, or as a region above the table.
Keeping the grid geometrically honest is what makes the export trustworthy.
How do I get the data out as a spreadsheet?
Export PAGE XML, which carries the full cell/row/column model, then transform. Plain-text export throws the grid away and is useless for tabular data.
python
# Sketch: PAGE XML cells -> CSV rows preserving column order
import csv
import xml.etree.ElementTree as ET
ns = {"p": "http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15"}
tree = ET.parse("page0007.xml")
rows = {}
for cell in tree.iter("{%s}TableCell" % ns["p"]):
r = int(cell.get("row")); c = int(cell.get("col"))
text = "".join(t.text or "" for t in cell.iter("{%s}Unicode" % ns["p"]))
rows.setdefault(r, {})[c] = text.strip()
with open("register.csv", "w", newline="", encoding="utf-8") as f:
w = csv.writer(f)
for r in sorted(rows):
w.writerow([rows[r].get(c, "") for c in sorted(rows[r])])From CSV the register flows into a database, a pandas dataframe, or a visualisation — the payoff for getting the grid right at transcription time.
Key Takeaways
- Use Transkribus table mode, not plain line detection, for any register or ledger.
- Define cells, rows, and columns explicitly so position survives into PAGE XML.
- Automatic detection suits clean printed tables; build a reusable template for consistent registers.
- Model merged cells as single spanning cells and keep empty cells in place to preserve alignment.
- Export PAGE XML (not plain text), then transform to CSV/spreadsheet/database.
- Getting the grid right at transcription time is what turns images into queryable structured data.
Frequently Asked Questions
Can Transkribus transcribe tables and registers?
Yes. Transkribus has a dedicated table mode that lets you draw a table region, define rows and columns, and recognise each cell, so registers and ledgers export with their column structure intact rather than as a flat block of text.
How do I get columns to stay separate on export?
Define the table structure explicitly with cells and columns rather than relying on plain line detection. Properly defined cells carry their row and column position into the PAGE XML, which you can then export to a CSV or spreadsheet.
Should I use automatic or manual table detection?
Use automatic table detection on regular, ruled tables to save time, then manually correct the grid where rulings are faint or merged. Irregular historical registers usually need a manual table template applied across pages.
How do I handle a column that spans the whole collection?
Build the table structure once on a representative page and reuse it as a template across the document, since registers usually keep the same columns page after page. Then only the cell contents change per page.
What format preserves the table structure on export?
PAGE XML preserves the full cell, row, and column model. From there you can transform to CSV, a spreadsheet, or a database; a plain-text export discards the grid and is unsuitable for tabular data.
How are merged or empty cells handled?
Define a merged cell as a single cell spanning the rows or columns it covers, and leave genuinely empty cells empty rather than deleting them, so the column alignment of every other row stays correct on export.