Improve a Weak Transkribus Model

A weak Transkribus model is almost always fixable, but only if you diagnose before you retrain. High CER usually comes from one of four causes: too little ground truth, inconsistent transcription, broken line segmentation, or training without a matching base model. The cure is rarely "more data" in general — it is the right data aimed at the model's specific failures. Here is how to find the fault and push accuracy up.

How do you tell a layout problem from a text-model problem?

This is the first fork, and skipping it wastes weeks. If the baselines are wrong — lines merged, split, skipped or out of order — the recogniser produces garbage regardless of how good the HTR model is. Open a few error-heavy pages and look at the detected lines:

Lines merged into one, or a line split into two pieces.
Marginalia swallowed into the main text block.
Lines read out of reading order.

If you see these, the fix is segmentation, not the text model. Re-run layout with a different line-detection model, correct baselines on the worst pages, and only then measure the text model's true CER.

How do you find where the model actually fails?

Open the model's validation results and the side-by-side compare view that shows machine output against ground truth. Look for patterns, not single mistakes:

Symptom	Likely cause	Fix
One letterform always wrong (e.g. long-s)	Underrepresented in training	Add pages rich in that form
One scribe's pages bad	Hand absent from training	Add ground truth from that hand
Abbreviations expanded inconsistently	Transcription rules not fixed	Standardise, re-transcribe
Whole page gibberish	Segmentation, not text	Fix baselines first

Cluster the errors by character and by page/hand. Those clusters tell you exactly which ground truth to add.

Why does adding random ground truth barely help?

CER is dominated by the cases the model gets wrong. Adding more easy pages it already reads at 99% accuracy moves the needle almost not at all. The high-value additions are pages that contain the failing letterforms, hands and constructions. A focused 5,000 words from the hard cases beats 20,000 words of more of the same.

text

Targeting strategy
  1. List the 10 worst-CER pages from validation
  2. Identify the shared difficulty (script, ink, hand, abbreviation)
  3. Transcribe new ground truth that is rich in that difficulty
  4. Fine-tune as a new version, re-measure CER

How do you fix inconsistent transcription?

Inconsistent ground truth actively teaches the model wrong answers. Decide explicit rules and apply them everywhere:

Expand or keep abbreviations — pick one and be consistent.
Decide how to render u/v, i/j, long-s and capitalisation.
Normalise whitespace and line-break handling.

Re-check existing ground truth against the rules before adding more. A clean, consistent 12,000 words will out-train a messy 30,000.

How do you retrain and confirm it actually improved?

Always fine-tune your existing model on the combined old-plus-new ground truth, and save it as a new version so you can compare:

text

Models
  Diary_1640s_v1   CER 11.4%   (baseline)
  Diary_1640s_v2   CER  6.8%   (+4,500 words targeting long-s & abbreviations)

Compare v2 against v1 on the same validation set. If CER dropped, keep it; if it regressed, your new data was inconsistent or off-target. This versioned, measured loop is the whole discipline of improving a model.

When should you stop and just correct by hand?

Improvement has diminishing returns. Once validation CER meets your threshold — under 10% for keyword search, under 5% for editing — further training costs more effort than it saves. At that point, human correction in the editor is the cheaper path to a finished transcription. Knowing when to stop is as important as knowing how to improve.

Key Takeaways

Diagnose first: distinguish a layout/segmentation fault from a text-model fault.
Garbled whole pages usually mean broken baselines, not a bad HTR model.
Use validation results to cluster errors by character, hand and page.
Add ground truth that targets the specific failures, not more easy pages.
Fix transcription inconsistencies before adding more data.
Always fine-tune as a new version and compare CER to prove improvement.
Stop when CER meets your threshold; beyond that, correct by hand.

Frequently Asked Questions

Why is my Transkribus model so inaccurate?

The usual causes are too little ground truth, inconsistent transcription, a layout step that mis-segments lines, or training without a matching base model. Diagnose which one applies before adding data, because more bad data will not help.

Does adding more ground truth always improve a model?

Only if the new data is consistent and covers the cases the model gets wrong. Adding more of the same easy pages barely moves CER; adding examples from the hardest pages and rarest letterforms is what raises accuracy.

How do I find where my model makes mistakes?

Use the validation set's error breakdown and the sample compare view to see which characters and pages have the highest error. Cluster the errors, then target ground truth at those specific weak spots.

Could the problem be layout rather than the text model?

Yes, very often. If baselines are merged, split or missing, the recogniser reads garbled or duplicated text no matter how good the HTR model is. Fix segmentation first, then re-evaluate the text model's true CER.

Should I retrain or fine-tune the existing model?

Fine-tune your existing model on the combined old plus new ground truth as a new version. Keep each version so you can compare CER and confirm the change actually helped rather than regressed.

When is a model good enough to stop improving?

When its validation CER meets the threshold for your use: under 10% for keyword search, under 5% for editing. Past that, extra effort yields diminishing returns and human correction is more cost-effective.

How do you tell a layout problem from a text-model problem? ​

How do you find where the model actually fails? ​

Why does adding random ground truth barely help? ​

How do you fix inconsistent transcription? ​

How do you retrain and confirm it actually improved? ​

When should you stop and just correct by hand? ​

Key Takeaways ​

Frequently Asked Questions ​

Why is my Transkribus model so inaccurate? ​

Does adding more ground truth always improve a model? ​

How do I find where my model makes mistakes? ​

Could the problem be layout rather than the text model? ​

Should I retrain or fine-tune the existing model? ​

When is a model good enough to stop improving? ​

Related reading ​