Skip to content
Transkribus Workflows

Keyword spotting in Transkribus lets you search un-corrected transcriptions and still find names or terms the visible transcript got wrong, because it searches the model's full probability lattice rather than only the single best reading. That means you can run a model over thousands of pages, never correct a word, and still type a surname or place-name to get ranked, scored hits pointing you to the right pages. It turns raw HTR output into a finding aid before any human editing happens.

A handwriting model does not produce one answer per word — it produces a lattice of weighted alternatives. The visible transcript is just the single most probable path through that lattice. Keyword spotting queries the whole lattice.

text
Lattice for one written word:
  "Iohannes" (0.41)
  "Lohannes" (0.22)
  "Johannes" (0.19)   ← exact search shows only the top line
  "Iobannes" (0.11)

A plain string search for "Johannes" finds nothing here, because the top-1 reading was "Iohannes". Keyword spotting checks all branches and returns the page with a confidence score.

Both are "search," but they look at different things:

Full-text searchKeyword spotting
SearchesTop-1 transcript onlyFull probability lattice
Needs correction?Effectively yesNo
Recall on raw HTRLowHigh
ReturnsExact matchesRanked, scored candidates
Best forEdited, finished textDiscovery across un-corrected pages

Use full-text search once a collection is corrected; use keyword spotting during the messy middle, when you have machine output and need to know where to look.

When should I reach for keyword spotting?

It shines in a few situations:

  • Triaging a huge collection — find the 40 pages mentioning a person among 8,000.
  • Prosopography — locating every occurrence of a family across un-edited registers.
  • Deciding what to correct — only hand-correct the pages that actually matter to your question.
  • Spelling variation — historical orthography means one name appears a dozen ways; approximate matching catches them.

How do I run a keyword search step by step?

  1. Run a recognition model over the collection so each page has a lattice (the standard HTR run produces this).
  2. Open the keyword spotting / smart search panel for the collection.
  3. Enter your term; enable approximate matching for variable spelling.
  4. Sort the result list by confidence score.
  5. Click each hit to jump to the highlighted region on the page image.
  6. Verify high-confidence hits visually; treat low-confidence hits as leads.
text
Query: "Magdalena"  (fuzzy = on)
 → p.0142  score 0.93  ✓ verified
 → p.0317  score 0.88  ✓ verified
 → p.0509  score 0.41  ? check the image

Always confirm against the manuscript image — keyword spotting points you to evidence, it does not replace your eyes.

What are the limits I should plan for?

Keyword spotting inherits the underlying model's quality. On a hand the model reads poorly, the lattice is noisy and hits are weaker. It is also probabilistic: a low score is a hint, not proof, so do not report a name as "present" from a 0.4 hit alone. And it finds occurrences, not meaning — disambiguating which "Johannes" you found is still your job.

Key Takeaways

  • Keyword spotting searches the whole lattice, so it finds words the visible transcript got wrong.
  • It is built for un-corrected HTR output — no editing needed before you search.
  • Enable approximate matching to absorb historical spelling variation.
  • Sort by confidence; verify high-confidence hits, chase low-confidence ones as leads.
  • It is a discovery and triage tool: it tells you where to correct, not the final answer.
  • Hit quality tracks the underlying model, so a better-matched model means better spotting.

Frequently Asked Questions

What is keyword spotting in Transkribus?

Keyword spotting searches the probability lattice the HTR model produces, not just the single best transcription. It can surface a word even where the visible transcript guessed wrong, returning ranked hits with confidence scores across a whole collection.

Plain text search only matches the model's top-1 output, so a misrecognised word is unfindable. Keyword spotting searches alternative readings in the lattice, recovering hits that exact search misses on un-corrected pages.

Do I need to correct transcriptions before searching?

No — that is the whole point. Keyword spotting is designed for raw, un-corrected HTR output so you can locate relevant pages across thousands before deciding what to correct by hand.

Can I search for fuzzy or partial matches?

Yes. Keyword spotting supports approximate matching, so spelling variation and minor recognition slips still return ranked candidate hits rather than nothing.

How do I read the confidence score on a hit?

Each hit carries a score reflecting how strongly the lattice supports that word at that location. Sort by score, verify the high-confidence hits first, and treat low-confidence hits as leads to check rather than facts.

Does keyword spotting work in other languages and scripts?

It works for any language the recognition model handles, including historical hands, because it operates on the model's own lattice. Accuracy of hits follows the underlying model's quality for that script.