Clean noisy corpus text: A Practical Guide

To clean noisy corpus text, work in stages on copies: fix encoding, repair layout artefacts (hyphenated breaks, headers, page numbers), then handle OCR errors with measured, reviewed rules — and stop before you erase genuine historical variation. The discipline that matters most is keeping raw text read-only and writing every change as repeatable code, so you can always say exactly what you altered and why. Treat cleaning as a documented pipeline, never as ad-hoc edits.

Noise is anything that distorts counts without carrying meaning. The hard part of cleaning historical text is that one person's noise — an archaic spelling, a long-s — is another person's evidence. So the question is never "how clean can I make this?" but "what must I fix for this analysis without losing what matters?"

How do I diagnose the noise before fixing anything?

Measure before you edit. Sample 20 random pages and tally what actually goes wrong:

python

import random, re
from collections import Counter

lines = open("raw/doc_0007.txt", encoding="utf-8").read().split("\n")
sample = random.sample(lines, min(40, len(lines)))
junk = Counter()
for ln in sample:
    if re.search(r"\b\w*[0-9]\w*\b", ln):  junk["digit-in-word"] += 1
    if re.search(r"-\s*$", ln):            junk["hyphen-break"] += 1
    if re.match(r"^\s*\d+\s*$", ln):       junk["page-number"] += 1
print(junk)

This tells you which problems are frequent and worth automating versus rare ones better left alone. Cleaning effort should follow the diagnosis, not a generic checklist.

What is a safe order for cleaning steps?

Order matters because each step changes what the next one sees. A reliable sequence:

Encoding — force UTF-8, strip BOM and control characters.
Layout — rejoin hyphenated line breaks, drop running headers and bare page numbers.
Whitespace — collapse multiple spaces and blank lines.
OCR substitutions — apply a reviewed, conservative correction list.
(Later, separate stage) — case-folding, normalisation, tokenisation.

Steps 1–4 are mechanical and reversible in principle; step 5 is interpretive and belongs in the analysis pipeline, not the stored corpus.

How do I fix layout artefacts in code?

The layout layer is where the biggest, safest gains live:

python

import re

def fix_layout(text: str) -> str:
    text = re.sub(r"-\n(\w)", r"\1", text)               # rejoin word split across lines
    text = re.sub(r"^\s*\d+\s*$", "", text, flags=re.M)  # bare page numbers
    text = re.sub(r"\n{3,}", "\n\n", text)               # collapse blank runs
    return text

Running headers (CHAPTER IV — 47) repeat on every page; detect them by finding short lines that recur with high frequency across the document, then remove only those.

How should I handle OCR errors without overcorrecting?

Auto-correcting every token will invent words and hide real archaic spellings. Instead, build a small reviewed substitution map for the errors your diagnosis flagged as frequent and unambiguous:

OCR reads	True text	Safe to auto-fix?
`rn`	`m`	Only in known words
`1` (one)	`l` (ell)	Context-dependent
`tbe`	`the`	Yes
`arid`	`and`	Usually
`ſ` (long-s)	`s`	Yes

python

SUBS = {"tbe": "the", "arid": "and", "Tbe": "The"}
def fix_ocr(tokens):
    return [SUBS.get(t, t) for t in tokens]

For pages where OCR confidence is low across the board, the honest fix is re-OCR or HTR, not a longer substitution table.

How do I keep the cleaning auditable?

Reproducibility is what separates cleaning from tampering. Three rules:

Raw text is read-only; pipelines write to a new derived/ folder.
Every rule lives in version-controlled code with a one-line comment of intent.
Keep a diff sample: 10 before/after lines per rule, committed alongside the code.

bash

diff <(head -200 raw/doc_0007.txt) <(head -200 derived/doc_0007.txt) | head

If a reviewer asks "what did you change?", the answer should be a script and a diff, not your memory.

When is the corpus clean enough to stop?

Clean for the analysis, not for an abstract ideal. Re-run your headline measurement after each pass:

python

# relative frequency of a target term, before vs after a cleaning pass
print(round(text.lower().count("liberty") / len(text.split()), 5))

When an extra cleaning step no longer moves that number, you have reached diminishing returns. Over-cleaning past that point only risks deleting real variation.

Key Takeaways

Diagnose the noise on a sample before writing any cleaning rule.
Clean in stages: encoding, layout, whitespace, then reviewed OCR fixes.
Keep mechanical cleaning separate from interpretive normalisation.
Never edit master files in place; write derivatives from versioned code.
Use a small reviewed substitution map, not blanket auto-correction.
Stop cleaning once it stops changing your key measurement.

Frequently Asked Questions

What counts as noise in a historical corpus?

Noise is anything that distorts counts without carrying meaning: OCR misreads like 'tbe', hyphenated line breaks, running headers, page numbers, control characters and inconsistent spelling. The goal is to remove distortion while preserving genuine historical variation.

Should I fix historical spelling during cleaning?

Keep cleaning and spelling normalisation as separate stages. Cleaning removes mechanical noise such as OCR errors and layout artefacts; normalising 'olde' to 'old' is an interpretive choice you should make consciously, log, and be able to switch off.

How do I clean without destroying evidence?

Never edit your master files in place. Keep raw text read-only, write every cleaning step as code that produces a new derivative, and store before-and-after samples so you can audit exactly what each rule changed.

What is the single most common cleaning mistake?

Lowercasing and stripping punctuation too early, before sentence segmentation or named-entity work. Those steps destroy signal you cannot recover, so push them as late in the pipeline as the analysis allows.

How do I deal with OCR errors specifically?

Measure them first by sampling, then target the frequent, safe substitutions (such as 'rn' read as 'm') with a reviewed dictionary, and flag low-confidence pages for re-OCR rather than guessing. Do not attempt to auto-correct every token.

How do I know when the corpus is clean enough?

Stop when remaining noise no longer changes your results: re-run your key measurement after each cleaning pass and watch for the point where extra cleaning stops moving the numbers. Clean for the analysis, not for perfection.

How do I diagnose the noise before fixing anything? ​

What is a safe order for cleaning steps? ​

How do I fix layout artefacts in code? ​

How should I handle OCR errors without overcorrecting? ​

How do I keep the cleaning auditable? ​

When is the corpus clean enough to stop? ​

Key Takeaways ​

Frequently Asked Questions ​

What counts as noise in a historical corpus? ​

Should I fix historical spelling during cleaning? ​

How do I clean without destroying evidence? ​

What is the single most common cleaning mistake? ​

How do I deal with OCR errors specifically? ​

How do I know when the corpus is clean enough? ​

Related reading ​

How do I diagnose the noise before fixing anything?

What is a safe order for cleaning steps?

How do I fix layout artefacts in code?

How should I handle OCR errors without overcorrecting?

How do I keep the cleaning auditable?

When is the corpus clean enough to stop?

Key Takeaways

Frequently Asked Questions

What counts as noise in a historical corpus?

Should I fix historical spelling during cleaning?

How do I clean without destroying evidence?

What is the single most common cleaning mistake?

How do I deal with OCR errors specifically?

How do I know when the corpus is clean enough?

Related reading