When to Use OCR confidence scores

Use OCR confidence scores when you need to triage and prioritise — ranking which pages, lines or words a human should check first — and avoid relying on them as an absolute measure of accuracy. Confidence is the engine's own probability that its output is right; it correlates with correctness but does not guarantee it. So they are excellent for "where do I spend my limited correction budget?" and poor for "is this transcription good enough to publish?" That second question needs character error rate (CER) measured against ground truth, not confidence.

Are confidence scores the same as accuracy?

No, and conflating them is the costliest mistake. Confidence is internal and self-reported; accuracy is external and measured against truth. An engine running an out-of-domain script — say a Tesseract model trained on modern print, run on an 18th-century long-s text — can be confidently wrong, returning 0.95 on systematically misread characters.

The practical rule: confidence tells you the engine's belief, CER tells you the reality. You validate the relationship once, on a labelled sample, then trust confidence only within that calibrated envelope.

When should I actually use confidence scores?

They earn their keep in a few clear situations:

Prioritised human correction. Sort lines by confidence ascending; reviewers fix the worst first and stop when the budget runs out. This alone can cut review time by half.
Adaptive routing. Send low-confidence regions to a second, slower engine or to HTR.
Quality dashboards. Track median page confidence across a digitisation batch to spot a scanner drifting or a bad model.
Downstream filtering. Exclude sub-threshold lines from a named-entity extraction run where noise hurts more than missing data.

They are the wrong tool when you need a publication-grade guarantee, when comparing two different engines (their scales are not comparable), or as a stand-in for a ground-truth evaluation.

What threshold should I use to flag text for review?

There is no universal number — thresholds are engine- and corpus-specific — so calibrate. Take a few hundred lines with known ground truth, sweep the threshold, and find where flagged lines capture most real errors without flagging everything:

python

import numpy as np

def threshold_sweep(conf, has_error, grid=np.arange(0.5, 0.99, 0.02)):
    for t in grid:
        flagged = conf < t
        recall = (flagged & has_error).sum() / max(has_error.sum(), 1)
        precision = (flagged & has_error).sum() / max(flagged.sum(), 1)
        print(f"t={t:.2f}  recall={recall:.2f}  precision={precision:.2f}")

# pick the t where recall is high but you aren't flagging the whole batch

In practice the useful band is often 0.70 to 0.90, but let the sweep, not folklore, choose it.

Word-level versus character-level: which to trust?

Character-level confidence is finer but noisier; word-level is smoother and routes whole lines more stably. For archival triage, the sweet spot is usually to aggregate character confidence up to the line, e.g. the line's minimum or 10th-percentile character confidence, which surfaces lines with even a single bad character.

Granularity	Good for	Watch out for
Character	Spotting single bad glyphs	Very noisy; many false alarms
Word	Routing tokens to a lexicon check	Hides intra-word errors
Line	Human-review triage	One bad char can sink a whole good line

Do HTR confidence scores behave differently?

Yes. CTC-based HTR models tend to be over-confident on unfamiliar hands and are poorly calibrated out of the box. A score of 0.9 from an HTR model on a new scribe is not the same as 0.9 from a print engine on clean type. Use HTR confidence strictly as a relative ranking within one model and one corpus — never to compare scribes, models or to set an absolute publish/no-publish line.

Should I store confidence scores, and how?

Yes — persist them so you can re-triage without re-running OCR. ALTO XML carries word confidence in the WC attribute (0-1) and character confidence in CC; PAGE XML supports a conf attribute on words and glyphs.

xml

<String CONTENT="parish" WC="0.41" CC="9 2 1 8 0 7"/>

Storing them lets you audit a batch months later, change your threshold, or filter the corpus for a text-mining run — all without touching the original images again.

Can low confidence ever be ignored safely?

Yes, when the cost of an error is low. A full-text search index tolerates noise: a slightly wrong word still helps fuzzy retrieval, so skipping review below a threshold is reasonable. But for a scholarly digital edition or quantitative analysis, low confidence is a hard stop — the downstream cost of silent errors is far higher than the review time.

Key Takeaways

Confidence is the engine's belief; accuracy is reality — use CER on ground truth for quality claims.
The killer use case is triage: rank lines by confidence ascending, review worst-first within budget.
Thresholds are engine- and corpus-specific — calibrate by sweeping against a labelled sample.
Aggregate character confidence to the line level for the most stable review routing.
HTR confidence is over-confident and poorly calibrated; treat it as relative ranking only.
Persist confidence in ALTO (WC/CC) or PAGE XML so you can re-triage and audit later.
Ignoring low confidence is fine for search indexing, never for scholarly editions or statistics.

Frequently Asked Questions

Are OCR confidence scores the same as accuracy?

No. Confidence is the engine's internal probability that a character or word is correct; accuracy is whether it actually is. They correlate, but a confident engine can still be confidently wrong, especially on out-of-domain scripts.

When are OCR confidence scores most useful?

They are most useful for triage and prioritised correction: ranking pages or lines so humans review the least-confident first. They are weakest as an absolute quality guarantee or as a substitute for measuring CER on ground truth.

What confidence threshold should I use to flag text for review?

There is no universal number, because thresholds are engine- and corpus-specific. Calibrate on a labelled sample: pick the threshold where flagged lines capture most real errors without flagging everything, often somewhere between 0.70 and 0.90.

Can I trust word-level or character-level confidence more?

Character-level confidence is finer-grained but noisier; word-level is smoother and better for routing whole lines to review. For most archival triage, aggregate character confidence to the line level.

Do HTR engines produce reliable confidence scores?

HTR confidence from CTC-based models is usable for ranking but tends to be poorly calibrated and over-confident on unfamiliar hands. Treat it as a relative signal within one model and corpus, not an absolute probability.

Should I store confidence scores in my output?

Yes — ALTO XML has a WC (word confidence) attribute and PAGE XML supports confidence too. Persisting them lets you re-triage, audit quality and filter downstream text mining without re-running OCR.

Can low confidence ever be safely ignored?

Yes, when the cost of an error is low — full-text search indexing tolerates some noise, so you may skip review below a threshold. For a scholarly edition or quantitative analysis, treat low confidence as a hard stop.

Are confidence scores the same as accuracy? ​

When should I actually use confidence scores? ​

What threshold should I use to flag text for review? ​

Word-level versus character-level: which to trust? ​

Do HTR confidence scores behave differently? ​

Should I store confidence scores, and how? ​

Can low confidence ever be ignored safely? ​

Key Takeaways ​

Frequently Asked Questions ​

Are OCR confidence scores the same as accuracy? ​

When are OCR confidence scores most useful? ​

What confidence threshold should I use to flag text for review? ​

Can I trust word-level or character-level confidence more? ​

Do HTR engines produce reliable confidence scores? ​

Should I store confidence scores in my output? ​

Can low confidence ever be safely ignored? ​

Related reading ​