Appearance
To clean noisy corpus text, work in stages on copies: fix encoding, repair layout artefacts (hyphenated breaks, headers, page numbers), then handle OCR errors with measured, reviewed rules — and stop before you erase genuine historical variation. The discipline that matters most is keeping raw text read-only and writing every change as repeatable code, so you can always say exactly what you altered and why. Treat cleaning as a documented pipeline, never as ad-hoc edits.
Noise is anything that distorts counts without carrying meaning. The hard part of cleaning historical text is that one person's noise — an archaic spelling, a long-s — is another person's evidence. So the question is never "how clean can I make this?" but "what must I fix for this analysis without losing what matters?"
How do I diagnose the noise before fixing anything?
Measure before you edit. Sample 20 random pages and tally what actually goes wrong:
python
import random, re
from collections import Counter
lines = open("raw/doc_0007.txt", encoding="utf-8").read().split("\n")
sample = random.sample(lines, min(40, len(lines)))
junk = Counter()
for ln in sample:
if re.search(r"\b\w*[0-9]\w*\b", ln): junk["digit-in-word"] += 1
if re.search(r"-\s*$", ln): junk["hyphen-break"] += 1
if re.match(r"^\s*\d+\s*$", ln): junk["page-number"] += 1
print(junk)This tells you which problems are frequent and worth automating versus rare ones better left alone. Cleaning effort should follow the diagnosis, not a generic checklist.
What is a safe order for cleaning steps?
Order matters because each step changes what the next one sees. A reliable sequence:
- Encoding — force UTF-8, strip BOM and control characters.
- Layout — rejoin hyphenated line breaks, drop running headers and bare page numbers.
- Whitespace — collapse multiple spaces and blank lines.
- OCR substitutions — apply a reviewed, conservative correction list.
- (Later, separate stage) — case-folding, normalisation, tokenisation.
Steps 1–4 are mechanical and reversible in principle; step 5 is interpretive and belongs in the analysis pipeline, not the stored corpus.
How do I fix layout artefacts in code?
The layout layer is where the biggest, safest gains live:
python
import re
def fix_layout(text: str) -> str:
text = re.sub(r"-\n(\w)", r"\1", text) # rejoin word split across lines
text = re.sub(r"^\s*\d+\s*$", "", text, flags=re.M) # bare page numbers
text = re.sub(r"\n{3,}", "\n\n", text) # collapse blank runs
return textRunning headers (CHAPTER IV — 47) repeat on every page; detect them by finding short lines that recur with high frequency across the document, then remove only those.
How should I handle OCR errors without overcorrecting?
Auto-correcting every token will invent words and hide real archaic spellings. Instead, build a small reviewed substitution map for the errors your diagnosis flagged as frequent and unambiguous:
| OCR reads | True text | Safe to auto-fix? |
|---|---|---|
rn | m | Only in known words |
1 (one) | l (ell) | Context-dependent |
tbe | the | Yes |
arid | and | Usually |
ſ (long-s) | s | Yes |
python
SUBS = {"tbe": "the", "arid": "and", "Tbe": "The"}
def fix_ocr(tokens):
return [SUBS.get(t, t) for t in tokens]For pages where OCR confidence is low across the board, the honest fix is re-OCR or HTR, not a longer substitution table.
How do I keep the cleaning auditable?
Reproducibility is what separates cleaning from tampering. Three rules:
- Raw text is read-only; pipelines write to a new
derived/folder. - Every rule lives in version-controlled code with a one-line comment of intent.
- Keep a
diffsample: 10 before/after lines per rule, committed alongside the code.
bash
diff <(head -200 raw/doc_0007.txt) <(head -200 derived/doc_0007.txt) | headIf a reviewer asks "what did you change?", the answer should be a script and a diff, not your memory.
When is the corpus clean enough to stop?
Clean for the analysis, not for an abstract ideal. Re-run your headline measurement after each pass:
python
# relative frequency of a target term, before vs after a cleaning pass
print(round(text.lower().count("liberty") / len(text.split()), 5))When an extra cleaning step no longer moves that number, you have reached diminishing returns. Over-cleaning past that point only risks deleting real variation.
Key Takeaways
- Diagnose the noise on a sample before writing any cleaning rule.
- Clean in stages: encoding, layout, whitespace, then reviewed OCR fixes.
- Keep mechanical cleaning separate from interpretive normalisation.
- Never edit master files in place; write derivatives from versioned code.
- Use a small reviewed substitution map, not blanket auto-correction.
- Stop cleaning once it stops changing your key measurement.
Frequently Asked Questions
What counts as noise in a historical corpus?
Noise is anything that distorts counts without carrying meaning: OCR misreads like 'tbe', hyphenated line breaks, running headers, page numbers, control characters and inconsistent spelling. The goal is to remove distortion while preserving genuine historical variation.
Should I fix historical spelling during cleaning?
Keep cleaning and spelling normalisation as separate stages. Cleaning removes mechanical noise such as OCR errors and layout artefacts; normalising 'olde' to 'old' is an interpretive choice you should make consciously, log, and be able to switch off.
How do I clean without destroying evidence?
Never edit your master files in place. Keep raw text read-only, write every cleaning step as code that produces a new derivative, and store before-and-after samples so you can audit exactly what each rule changed.
What is the single most common cleaning mistake?
Lowercasing and stripping punctuation too early, before sentence segmentation or named-entity work. Those steps destroy signal you cannot recover, so push them as late in the pipeline as the analysis allows.
How do I deal with OCR errors specifically?
Measure them first by sampling, then target the frequent, safe substitutions (such as 'rn' read as 'm') with a reviewed dictionary, and flag low-confidence pages for re-OCR rather than guessing. Do not attempt to auto-correct every token.
How do I know when the corpus is clean enough?
Stop when remaining noise no longer changes your results: re-run your key measurement after each cleaning pass and watch for the point where extra cleaning stops moving the numbers. Clean for the analysis, not for perfection.