Troubleshooting: Assess OCR quality impact on analysis

Q: Why did my topic model produce a 'junk' topic?

A junk topic full of OCR fragments (tlie, tbe, fhe) means noisy documents clustered by their errors. Filter high-error documents or add OCR artefacts to the stopword list.

To assess how OCR quality affects your analysis, measure error with proxy signals when you lack ground truth (in-dictionary rate, token length, non-alphabetic share), match an acceptable threshold to your method's sensitivity, and run every key result at two quality thresholds to show it isn't an artefact of noise. OCR error is not random — it hits long and rare words hardest — so it biases findings in predictable, fixable ways. This guide diagnoses the usual symptoms.

How do I measure OCR quality with no ground truth?

You rarely have transcriptions for a whole corpus, so use proxies that correlate strongly with true error:

python

import re
import pandas as pd

def quality_proxies(text, vocab):
    toks = re.findall(r"[A-Za-z]+", text.lower())
    if not toks:
        return {"in_dict": 0, "mean_len": 0, "nonalpha": 1}
    in_dict = sum(t in vocab for t in toks) / len(toks)
    mean_len = sum(len(t) for t in toks) / len(toks)
    nonalpha = len(re.findall(r"[^A-Za-z\s]", text)) / max(len(text), 1)
    return {"in_dict": in_dict, "mean_len": mean_len, "nonalpha": nonalpha}

A document where only 55% of tokens are in-dictionary is almost certainly badly OCR'd. Validate the proxies once against a small hand-transcribed sample, then trust them at scale.

What error rate is acceptable for my method?

Different analyses tolerate different noise. Match the threshold to the task:

Method	Approx. CER tolerance	Why
Top word frequencies	up to ~10%	Common words survive errors
Topic modelling	~5%	Errors create junk topics
Named-entity recognition	~3-5%	Proper nouns are rare and fragile
Collocations / rare words	< 3%	Distinctive words hit hardest

If your method sits at the bottom of this table, "the OCR is fine" from a frequency study does not transfer.

Why did my topic model grow a junk topic?

Symptom: one topic is dominated by tlie, tbe, fhe, arid. Root cause: noisy documents cluster by their shared OCR errors rather than content. Two fixes:

Filter — drop documents below your in-dictionary threshold before modelling.
Stopword — add common OCR artefacts to the stoplist so they can't anchor a topic.

python

ocr_junk = {"tlie", "tbe", "fhe", "arid", "ofthe", "tbat", "witb"}
docs = [[t for t in doc if t not in ocr_junk] for doc in docs]

Filtering is usually better, because the junk topic is a symptom — the affected documents are noisy throughout, not just in those tokens.

Does OCR error push results in a consistent direction?

Yes, and this is the crucial point for interpretation. OCR engines misread long, rare and archaic words far more often than short common ones. So a noisy corpus systematically under-counts distinctive vocabulary — exactly the words that carry the cultural signal. A "decline" in some specialised term may simply be rising OCR difficulty in older, lower-quality scans. Always check whether your trend correlates with scan quality over time.

Should I correct OCR or filter it?

For analysis at scale, filter or down-weight by quality rather than post-correcting. Automated post-correction can help readability but introduces its own systematic errors (it "fixes" rare real words toward common ones), which is worse for analysis than honest gaps. Reserve correction for close-reading targets, not whole-corpus statistics.

How do I report this in a paper?

Make the sensitivity visible:

Report the quality (CER or proxy) distribution across the corpus.
State your threshold and how many documents it removed.
Re-run one headline result at two thresholds (say, in-dictionary >= 0.7 and >= 0.85). If the finding holds at both, it's robust; if it flips, the result was an OCR artefact and you've just saved yourself an embarrassing claim.

Key Takeaways

Without ground truth, estimate OCR quality from in-dictionary rate, token length and non-alphabetic share.
Match your acceptable error threshold to the method — NER and collocations need far cleaner text than word frequencies.
A junk topic of OCR fragments means noisy documents clustered by their errors; filter them out.
OCR error is directional: it under-counts long, rare and distinctive words, biasing cultural signals.
Check whether trends correlate with scan quality over time before declaring them real.
Prefer filtering/down-weighting over automated post-correction for whole-corpus statistics.
Report the quality distribution, threshold, documents dropped, and a two-threshold sensitivity check.

Frequently Asked Questions

How do I measure OCR quality without ground truth?

Use proxy signals: the proportion of in-dictionary tokens, mean token length, and the rate of non-alphabetic characters. They correlate well with true error rate and need no transcription.

What character error rate is 'good enough' for analysis?

It depends on the method. Word frequencies tolerate up to ~10% CER; topic modelling and NER degrade noticeably above ~5%; collocation and rare-word work need cleaner text still.

Why did my topic model produce a 'junk' topic?

A junk topic full of OCR fragments (tlie, tbe, fhe) means noisy documents clustered by their errors. Filter high-error documents or add OCR artefacts to the stopword list.

Does OCR error bias results in one direction?

Yes. Errors disproportionately hit long, rare and non-standard words, so noisy corpora systematically under-count exactly the distinctive vocabulary you often care about.

Should I correct OCR or just filter bad documents?

For analysis at scale, filtering or down-weighting by quality is usually cheaper and more honest than post-correction, which can introduce its own systematic errors.

How do I report OCR quality in a paper?

State the CER (or proxy) distribution, the threshold you applied, how many documents you dropped, and re-run a key result at two thresholds to show sensitivity.

How do I measure OCR quality with no ground truth? ​

What error rate is acceptable for my method? ​

Why did my topic model grow a junk topic? ​

Does OCR error push results in a consistent direction? ​

Should I correct OCR or filter it? ​

How do I report this in a paper? ​

Key Takeaways ​

Frequently Asked Questions ​

How do I measure OCR quality without ground truth? ​

What character error rate is 'good enough' for analysis? ​

Why did my topic model produce a 'junk' topic? ​

Does OCR error bias results in one direction? ​

Should I correct OCR or just filter bad documents? ​

How do I report OCR quality in a paper? ​

Related reading ​