Appearance
To quality-control Transkribus output, triage instead of reading everything: sample pages to estimate the error rate, hunt for systematic errors that repeat across the collection, sort pages by recognition confidence to surface the worst offenders, and fully correct only the text that actually feeds your edition or database. Most QC time is wasted reading good pages; a disciplined checklist concentrates effort on the errors that matter and the patterns you can fix once for the whole batch.
What should a QC pass actually check?
A good pass moves from cheap-and-broad to expensive-and-narrow:
- Estimate — measure CER on a small random sample to know roughly how bad things are.
- Pattern-hunt — find systematic, repeating errors (one fix, many corrections).
- Confidence triage — review the lowest-confidence pages first.
- Structural check — verify line order, regions, and tags survived.
- Targeted correction — fully proofread only the must-be-perfect text.
How do I tell systematic from random errors?
This distinction is the heart of efficient QC.
| Systematic error | Random error | |
|---|---|---|
| Pattern | Repeats predictably | One-off |
| Example | Long-s ſ always read as f | A smudge misread once |
| Fix | One find-and-replace, or retrain | Manual, page by page |
| Priority | Find these first | Mop up after |
One systematic find-and-replace can correct thousands of instances in seconds. Spend your first hour looking for patterns, not correcting individual words.
bash
# Example: fix a systematic long-s misread across an exported text dump
sed -i 's/ſ/s/g; s/poſſ/poss/g' collection_export.txt
# (verify on a sample first — never blind-replace a whole corpus)How do I find the worst pages without reading them all?
Use the model's confidence scores. Transkribus records a confidence per line; pages with low mean confidence are overwhelmingly the badly-recognised ones.
text
Sort pages by mean line confidence (ascending):
p.0233 0.61 ← review first
p.0118 0.66
...
p.0044 0.98 ← almost certainly fine, review lastReviewing the bottom 15% of pages by confidence typically catches the large majority of serious errors for a fraction of the reading effort.
What CER is "good enough"?
There is no universal threshold — it depends on the destination of the text.
- Keyword discovery / search: 10-15% CER is often acceptable.
- Statistical text mining: aim for under ~5%.
- A published critical or documentary edition: under 1-2% after correction.
Match effort to purpose. Polishing search-only material to edition quality is wasted labour; shipping edition material at 8% is malpractice.
Why proofread against the image, not the transcript?
HTR errors are often plausible — the model produces a real word that is simply the wrong word. Reading the transcript alone, your eye glides over "house" where the manuscript says "horse." Always work side by side: image left, text right, comparing line for line. It is slower per page, which is exactly why confidence triage matters — you only do it where it counts.
How do I stop an error coming back next batch?
If QC keeps catching the same systematic misread, fix the cause, not the symptom. Add corrected lines containing that character to your ground truth and retrain or fine-tune the model. The next batch arrives already correct, and you stop paying the find-and-replace tax forever.
Key Takeaways
- QC is triage: sample, pattern-hunt, confidence-sort, then correct selectively.
- Hunt systematic errors first — one fix corrects thousands of instances.
- Sort pages by recognition confidence and review the lowest-scoring first.
- Set your target CER by the text's destination, not an abstract ideal.
- Always proofread image against text, because HTR errors are often plausible wrong words.
- Cure recurring systematic errors by retraining, not by repeated find-and-replace.
Frequently Asked Questions
How do I quality control Transkribus output efficiently?
Triage rather than read everything. Sample pages to estimate CER, hunt for systematic errors that repeat, sort by recognition confidence to find the worst pages, and fully correct only the material that feeds your edition or database.
What is the difference between systematic and random errors?
A systematic error repeats predictably — the model always reads a long-s as f, for instance — and can be fixed with one find-and-replace. A random error is a one-off slip that needs manual correction. QC should prioritise finding systematic ones.
What CER is good enough to skip full correction?
It depends on use. For keyword discovery, 10-15% CER may be fine; for a published edition you want under 1-2% after correction. Match the effort to the intended use of the text, not an abstract ideal.
Can I use confidence scores to find bad pages?
Yes. Sort or filter pages by the model's mean line confidence and review the lowest-scoring pages first. Low confidence strongly correlates with poor recognition, so it concentrates your correction effort where it pays off.
Should I proofread against the image or the transcript alone?
Always against the image. Reading only the transcript hides errors that produce plausible but wrong words; side-by-side image-and-text comparison is the only reliable way to catch substantive mistakes.
How do I stop the same error recurring across a collection?
Fix it at the source: if the model systematically misreads a character, add corrected examples to the ground truth and retrain, rather than repeatedly find-and-replacing the same mistake on every batch.