Measure CER and WER for Your Models

Q: What is the difference between CER and WER?

CER (character error rate) measures errors per character; WER (word error rate) measures errors per word. WER is always equal to or higher than CER because a single wrong character makes the whole word wrong.

Q: What CER counts as a good HTR model?

Roughly: under 5% CER is usable, under 2.5% is good, and under 1% is excellent. The acceptable threshold depends on whether output is for full-text search or scholarly publication.

Q: How is CER actually calculated?

It is the Levenshtein edit distance (insertions, deletions, substitutions) between predicted and reference text, divided by the number of characters in the reference.

Q: Why is my WER so much higher than my CER?

Because errors concentrate. One wrong character marks an entire word wrong for WER, so even a low CER can produce a noticeably higher WER, especially with long words.

Q: Should I normalise text before scoring?

Decide and document a policy. Comparing raw output to raw ground truth is strictest; light normalisation (whitespace, case) gives a fairer engine comparison but must be applied identically to both sides.

Q: How big should my test set be?

Large enough to be representative of your hardest cases — often a few thousand words across varied pages. Tiny test sets give noisy, misleading error rates.

To measure model quality, compute CER (character error rate) and WER (word error rate) by comparing recognised text against a held-out ground-truth reference using edit distance. CER is the edit distance divided by reference characters; WER is the same idea at word level. These two numbers let you compare engines, prove that a preprocessing change helped, and decide whether output is fit for search or publication. As a rough guide, under 5% CER is usable, under 2.5% is good, and under 1% is excellent.

What exactly are CER and WER?

Both are based on Levenshtein edit distance — the minimum number of insertions, deletions and substitutions to turn the prediction into the reference.

CER = (S + D + I) / N_chars_reference
WER = (S + D + I) / N_words_reference

WER is always greater than or equal to CER because a single mistyped character invalidates an entire word at word level. A model with 2% CER might show 8-10% WER, which is normal, not a contradiction.

How do you compute them in practice?

Use a maintained library rather than hand-rolling edit distance. The Python jiwer package handles both metrics:

python

import jiwer

reference = "the quick brown fox"
hypothesis = "the qulck brown fox"

print("CER:", jiwer.cer(reference, hypothesis))  # 0.0526...
print("WER:", jiwer.wer(reference, hypothesis))  # 0.5

For a whole evaluation set, pass lists of strings (one per line or page) so the totals aggregate correctly rather than averaging per-line rates, which over-weights short lines.

What's a good error rate, and for what purpose?

The threshold depends entirely on use.

CER	Quality	Suitable for
> 10%	Poor	Triage only, heavy correction
5-10%	Marginal	Rough full-text search
2.5-5%	Usable	Searchable text, light editing
1-2.5%	Good	Most research uses
< 1%	Excellent	Near-publication quality

Full-text discovery tolerates higher error than a scholarly edition. A 5% CER corpus is perfectly searchable; the same corpus is not ready to print as a critical text.

Should you normalise before scoring?

This decision changes your numbers, so document it. Two valid stances:

Strict (raw): compare exactly as produced, including case, punctuation and the long-s. Honest but penalises non-errors like a stylistic choice.
Normalised: lowercase, collapse whitespace, and optionally fold archaic glyphs before scoring — applied identically to prediction and reference.

The cardinal rule: whatever normalisation you choose, apply it to both sides and report it, or your CER is not comparable to anyone else's.

How do you build a trustworthy test set?

Your evaluation is only as good as the held-out data:

Hold it out. Never evaluate on pages the model trained on — that flatters the score.
Make it representative. Include your worst hands, faded pages and unusual layouts, not just the easy ones.
Size it sensibly. A few thousand words across varied pages; tiny sets give noisy rates that swing wildly between runs.
Freeze it. Use the same test set across experiments so changes in CER reflect the model, not the data.

How do you use these numbers to drive decisions?

Treat CER as the control variable in a simple experiment loop: baseline, change one thing, re-measure on the frozen test set, keep what lowers CER. This is how you objectively choose between Tesseract and Kraken, justify a preprocessing recipe, or prove that another 50 pages of ground truth were worth transcribing. Per-character confusion analysis then tells you where the remaining errors live so you can target them.

Key Takeaways

CER measures character errors, WER measures word errors; WER is always ≥ CER.
Both rely on Levenshtein edit distance over a held-out reference.
Use a library like jiwer; aggregate totals rather than averaging per-line rates.
Rough quality bands: under 5% usable, under 2.5% good, under 1% excellent.
Document and apply any normalisation identically to prediction and reference.
Freeze a representative, held-out test set so CER changes reflect the model alone.

Frequently Asked Questions

What is the difference between CER and WER?

CER (character error rate) measures errors per character; WER (word error rate) measures errors per word. WER is always equal to or higher than CER because a single wrong character makes the whole word wrong.

What CER counts as a good HTR model?

Roughly: under 5% CER is usable, under 2.5% is good, and under 1% is excellent. The acceptable threshold depends on whether output is for full-text search or scholarly publication.

How is CER actually calculated?

It is the Levenshtein edit distance (insertions, deletions, substitutions) between predicted and reference text, divided by the number of characters in the reference.

Why is my WER so much higher than my CER?

Because errors concentrate. One wrong character marks an entire word wrong for WER, so even a low CER can produce a noticeably higher WER, especially with long words.

Should I normalise text before scoring?

Decide and document a policy. Comparing raw output to raw ground truth is strictest; light normalisation (whitespace, case) gives a fairer engine comparison but must be applied identically to both sides.

How big should my test set be?

Large enough to be representative of your hardest cases — often a few thousand words across varied pages. Tiny test sets give noisy, misleading error rates.

What exactly are CER and WER? ​

How do you compute them in practice? ​

What's a good error rate, and for what purpose? ​

Should you normalise before scoring? ​

How do you build a trustworthy test set? ​

How do you use these numbers to drive decisions? ​

Key Takeaways ​

Frequently Asked Questions ​

What is the difference between CER and WER? ​

What CER counts as a good HTR model? ​

How is CER actually calculated? ​

Why is my WER so much higher than my CER? ​

Should I normalise text before scoring? ​

How big should my test set be? ​

Related reading ​