Appearance
To assess how OCR quality affects your analysis, measure error with proxy signals when you lack ground truth (in-dictionary rate, token length, non-alphabetic share), match an acceptable threshold to your method's sensitivity, and run every key result at two quality thresholds to show it isn't an artefact of noise. OCR error is not random — it hits long and rare words hardest — so it biases findings in predictable, fixable ways. This guide diagnoses the usual symptoms.
How do I measure OCR quality with no ground truth?
You rarely have transcriptions for a whole corpus, so use proxies that correlate strongly with true error:
python
import re
import pandas as pd
def quality_proxies(text, vocab):
toks = re.findall(r"[A-Za-z]+", text.lower())
if not toks:
return {"in_dict": 0, "mean_len": 0, "nonalpha": 1}
in_dict = sum(t in vocab for t in toks) / len(toks)
mean_len = sum(len(t) for t in toks) / len(toks)
nonalpha = len(re.findall(r"[^A-Za-z\s]", text)) / max(len(text), 1)
return {"in_dict": in_dict, "mean_len": mean_len, "nonalpha": nonalpha}A document where only 55% of tokens are in-dictionary is almost certainly badly OCR'd. Validate the proxies once against a small hand-transcribed sample, then trust them at scale.
What error rate is acceptable for my method?
Different analyses tolerate different noise. Match the threshold to the task:
| Method | Approx. CER tolerance | Why |
|---|---|---|
| Top word frequencies | up to ~10% | Common words survive errors |
| Topic modelling | ~5% | Errors create junk topics |
| Named-entity recognition | ~3-5% | Proper nouns are rare and fragile |
| Collocations / rare words | < 3% | Distinctive words hit hardest |
If your method sits at the bottom of this table, "the OCR is fine" from a frequency study does not transfer.
Why did my topic model grow a junk topic?
Symptom: one topic is dominated by tlie, tbe, fhe, arid. Root cause: noisy documents cluster by their shared OCR errors rather than content. Two fixes:
- Filter — drop documents below your in-dictionary threshold before modelling.
- Stopword — add common OCR artefacts to the stoplist so they can't anchor a topic.
python
ocr_junk = {"tlie", "tbe", "fhe", "arid", "ofthe", "tbat", "witb"}
docs = [[t for t in doc if t not in ocr_junk] for doc in docs]Filtering is usually better, because the junk topic is a symptom — the affected documents are noisy throughout, not just in those tokens.
Does OCR error push results in a consistent direction?
Yes, and this is the crucial point for interpretation. OCR engines misread long, rare and archaic words far more often than short common ones. So a noisy corpus systematically under-counts distinctive vocabulary — exactly the words that carry the cultural signal. A "decline" in some specialised term may simply be rising OCR difficulty in older, lower-quality scans. Always check whether your trend correlates with scan quality over time.
Should I correct OCR or filter it?
For analysis at scale, filter or down-weight by quality rather than post-correcting. Automated post-correction can help readability but introduces its own systematic errors (it "fixes" rare real words toward common ones), which is worse for analysis than honest gaps. Reserve correction for close-reading targets, not whole-corpus statistics.
How do I report this in a paper?
Make the sensitivity visible:
- Report the quality (CER or proxy) distribution across the corpus.
- State your threshold and how many documents it removed.
- Re-run one headline result at two thresholds (say, in-dictionary
>= 0.7and>= 0.85). If the finding holds at both, it's robust; if it flips, the result was an OCR artefact and you've just saved yourself an embarrassing claim.
Key Takeaways
- Without ground truth, estimate OCR quality from in-dictionary rate, token length and non-alphabetic share.
- Match your acceptable error threshold to the method — NER and collocations need far cleaner text than word frequencies.
- A junk topic of OCR fragments means noisy documents clustered by their errors; filter them out.
- OCR error is directional: it under-counts long, rare and distinctive words, biasing cultural signals.
- Check whether trends correlate with scan quality over time before declaring them real.
- Prefer filtering/down-weighting over automated post-correction for whole-corpus statistics.
- Report the quality distribution, threshold, documents dropped, and a two-threshold sensitivity check.
Frequently Asked Questions
How do I measure OCR quality without ground truth?
Use proxy signals: the proportion of in-dictionary tokens, mean token length, and the rate of non-alphabetic characters. They correlate well with true error rate and need no transcription.
What character error rate is 'good enough' for analysis?
It depends on the method. Word frequencies tolerate up to ~10% CER; topic modelling and NER degrade noticeably above ~5%; collocation and rare-word work need cleaner text still.
Why did my topic model produce a 'junk' topic?
A junk topic full of OCR fragments (tlie, tbe, fhe) means noisy documents clustered by their errors. Filter high-error documents or add OCR artefacts to the stopword list.
Does OCR error bias results in one direction?
Yes. Errors disproportionately hit long, rare and non-standard words, so noisy corpora systematically under-count exactly the distinctive vocabulary you often care about.
Should I correct OCR or just filter bad documents?
For analysis at scale, filtering or down-weighting by quality is usually cheaper and more honest than post-correction, which can introduce its own systematic errors.
How do I report OCR quality in a paper?
State the CER (or proxy) distribution, the threshold you applied, how many documents you dropped, and re-run a key result at two thresholds to show sensitivity.