Handle Abbreviations in Transkribus

To handle abbreviations in Transkribus, decide one policy for the whole project — keep them as written (diplomatic) or expand them (reading) — and apply it consistently in your ground truth, because the HTR model simply learns the mapping you demonstrate. If you expand dñs to dominus the same way on every training page, the model will expand it automatically on new pages; if you waver, it produces unusable, inconsistent output. The abbreviation tag then lets you preserve both the original mark and its expansion so nothing is lost on export.

Expand or keep as written — which should I choose?

This is an editorial decision, not a technical one, and it drives everything downstream.

Approach	What you record	Best for	Trade-off
Diplomatic	The mark as written (`dñs`)	Paleographic study, manuscript fidelity	Harder to search/read
Expanded	The full word (dominus)	Reading editions, indexing, search	Loses visual form unless tagged
Tagged both	Mark and expansion via tag	Editions needing both	More tagging effort

The professional default for a searchable scholarly edition is expanded, but tagged — you get readability and the original mark survives in the markup.

How does the model learn abbreviations?

The recognition model has no dictionary of medieval shorthand; it learns from your ground truth. Whatever transcription sits beside an image of & or ꝯ or a macron is what it will reproduce.

text

Ground truth line A:  "dominus noster"   ← image shows  dñs nr̄
Ground truth line B:  "dñs noster"        ← inconsistent!

Feed it line A consistently and the macron-over-d reliably becomes dominus. Feed it both styles and the model guesses, badly. Consistency in ground truth is the whole game.

How do I tag abbreviations so both forms survive?

Use Transkribus's structural abbreviation tag on the span. It stores the abbreviated reading and the expansion together in the PAGE XML, and that pair maps cleanly to TEI on export.

xml

<!-- After export to TEI, a tagged abbreviation becomes: -->
<choice>
  <abbr>dñs</abbr>
  <expan>dominus</expan>
</choice>

A publishing stylesheet can then italicise the supplied letters, hide one form, or show both — your choice at render time, not transcription time.

Why is my model producing gibberish on brevigraphs?

Two usual culprits:

Mixed policy in training — the model saw the same mark expanded two ways.
Too few examples — a rare brevigraph like the con-/-us sign (ꝯ) appears only a handful of times.

Fix the first by standardising existing ground truth; fix the second by adding more lines containing that mark before retraining. A model needs repeated, consistent exposure to learn an abbreviation reliably.

What is the right workflow on a medieval collection?

A practical order of operations:

Write a short transcription convention document: list each common abbreviation and its agreed expansion.
Transcribe ground truth strictly to that convention.
Apply the abbreviation tag where you need to preserve the original mark.
Train (or fine-tune) and run recognition.
On export to TEI, verify abbr/expan (or am/ex) elements came through.
Spot-check that keyword search finds the expanded word.

This keeps medieval shorthand captured consistently — readable for editing, searchable for users, and faithful for paleographers.

Key Takeaways

Pick one abbreviation policy per project; consistency beats any individual choice.
The model learns abbreviation handling from your ground truth — it has no built-in expander.
Tag abbreviations to keep both the original mark and the expansion through to TEI.
Tagged abbreviations export to TEI abbr/expan and render as italics via a stylesheet.
Gibberish usually means mixed ground truth or too few examples of a rare brevigraph.
Storing expansions makes documents searchable while preserving paleographic fidelity.

Frequently Asked Questions

Should I expand abbreviations or transcribe them as written in Transkribus?

Decide once, at project level. For a diplomatic transcription, keep abbreviation marks as written; for a reading edition, expand them. The critical rule is consistency, because the HTR model learns whatever you teach it.

Can a Transkribus model learn to expand abbreviations automatically?

Yes. If your ground truth consistently expands a brevigraph to its full letters, the model learns that mapping and will expand it on new pages. Mixed ground truth produces unpredictable, unusable output.

How do I record both the abbreviated and expanded forms?

Use the abbreviation tag (an editorial structural tag) to mark the span and store the expansion, so the original and resolved forms are both preserved in the PAGE XML and survive export to TEI.

Why does my model output gibberish on abbreviation marks?

Usually the training data mixed expansion styles, or the brevigraph is rare. Standardise the ground truth to one policy and add more lines containing the mark so the model has enough examples to learn it.

How are expanded letters distinguished in a scholarly edition?

Conventionally expanded letters are italicised or wrapped in editorial markup. Transkribus abbreviation tags export to TEI ex/expan and abbr/am elements, which a publishing stylesheet can then render in italics.

Do abbreviations affect keyword search?

Strongly. A reader searching a modern spelling will miss an unexpanded brevigraph. Storing the expansion makes the full word findable while the original mark stays visible for paleographic accuracy.

Expand or keep as written — which should I choose? ​

How does the model learn abbreviations? ​

How do I tag abbreviations so both forms survive? ​

Why is my model producing gibberish on brevigraphs? ​

What is the right workflow on a medieval collection? ​

Key Takeaways ​

Frequently Asked Questions ​

Should I expand abbreviations or transcribe them as written in Transkribus? ​

Can a Transkribus model learn to expand abbreviations automatically? ​

How do I record both the abbreviated and expanded forms? ​

Why does my model output gibberish on abbreviation marks? ​

How are expanded letters distinguished in a scholarly edition? ​

Do abbreviations affect keyword search? ​

Related reading ​