Skip to content
OCR & HTR Pipelines

Ground truth transcription guidelines are the rulebook that keeps every transcriber producing the same answer for the same mark on the page, because an HTR model learns exactly what your data shows it. Effective guidelines fix policies for abbreviations, capitalisation, punctuation, line breaks, special characters and damaged text, illustrate each with a real image, and stay short enough that people actually read them. Inconsistent ground truth is the quiet reason many HTR projects plateau below their potential.

Why does consistency matter more than scholarly perfection?

A recognition model has no judgement; it generalises from patterns. If one transcriber expands wch to which and another keeps wch, the model sees the same letterforms mapped to two different targets and learns neither well. For training purposes, a consistent convention beats a correct but variable one. The goal of ground truth is reproducibility, not the perfect critical edition — you can normalise later.

What policies must your guidelines cover?

At minimum, settle these before anyone transcribes:

DecisionOptionsCommon choice for HTR
AbbreviationsKeep | ExpandKeep (diplomatic), mark if expanded
CapitalisationAs written | ModernisedAs written
PunctuationAs written | NormalisedAs written
Line breaksPhysical | LogicalPhysical
Long-s / archaic glyphsPreserve | ModernisePreserve at training time
Illegible textSkip | PlaceholderPlaceholder per character
Uncertain readingsPlain | BracketedBracketed [...]

How should you handle damaged and uncertain text?

Give transcribers fixed symbols and never let them silently guess. A workable scheme:

text
[abc]   = uncertain reading
[...]   = illegible, length unknown
⟨⟨ ⟩⟩  = editorial addition (avoid in ground truth)
¶       = literal mark present on page

Crucially, decide whether uncertain text goes into training at all. Many projects exclude bracketed and illegible passages from the training set so the model never learns to associate ink it cannot read with confident output.

How do you write rules people will actually follow?

Three principles keep guidelines usable:

  1. Show, don't only tell. Every rule needs a cropped image of the real case beside the expected transcription.
  2. Order by frequency. Put the rules that come up on every page first; bury rare edge cases in an appendix.
  3. Keep it short. Two to five pages with examples beats twenty pages nobody finishes.

A good test: hand the guidelines and five fresh pages to a new transcriber. If their output matches an experienced transcriber's within a few characters per line, the document works.

How do you keep a team aligned over time?

Consistency decays without maintenance. Run a short calibration: have everyone transcribe the same two pages, compare results line by line, and discuss every disagreement. Review the first batch of each transcriber's real work closely, then spot-check. When a genuinely new case appears, add it to the guidelines and tell the team — do not let people invent private conventions that fragment the data.

text
# Lightweight change log to append to the guidelines
2025-05-12  Added rule: superscript abbreviation marks
            transcribed inline, no special character.
2025-05-28  Clarified: catchwords belong to a separate
            region, not the main text stream.

How do guidelines connect to the wider pipeline?

Guidelines are not just a training-data document; they shape downstream normalisation and search. If you preserve the long-s and abbreviations in ground truth, plan a post-processing step that produces a normalised, searchable layer separately. This keeps the training data faithful while still giving readers and full-text search a clean modern form.

Key Takeaways

  • A model learns your data, so consistency outranks scholarly perfection in ground truth.
  • Decide abbreviations, capitalisation, punctuation, line breaks and damaged-text rules up front.
  • Use fixed symbols for uncertain and illegible text; consider excluding them from training.
  • Illustrate every rule with a real cropped image and an expected transcription.
  • Calibrate transcribers on shared pages and review early output to catch drift.
  • Maintain a change log and update guidelines when genuine new cases appear.

Frequently Asked Questions

Why do ground truth guidelines matter for HTR?

A model learns whatever your transcribers do. If two people transcribe the same abbreviation differently, the model receives contradictory signals and accuracy suffers, so consistent rules are essential.

Should I expand abbreviations in ground truth?

Pick one policy and apply it everywhere. Diplomatic transcription keeps abbreviations as written; expanding them helps readability but must be done uniformly, ideally marked so it can be reversed.

How do I handle damaged or illegible text?

Define a fixed convention, such as a placeholder for each illegible character and brackets for uncertain readings, so the model is not trained on guesses presented as certainties.

Do line breaks need rules?

Yes. Decide whether transcription follows physical lines on the page, which most HTR training requires, and whether you record hyphenation at line ends.

How long should transcription guidelines be?

Short enough to be read fully — often two to five pages — with worked examples. A wall of edge cases nobody reads is worse than a focused set of rules with images.

How do I keep multiple transcribers consistent?

Give worked examples, run a short calibration exercise on shared pages, and review early output. Update the guidelines when genuine new cases appear rather than letting people improvise.