Skip to content
Text Mining & Corpora

To clean noisy corpus text, work in stages on copies: fix encoding, repair layout artefacts (hyphenated breaks, headers, page numbers), then handle OCR errors with measured, reviewed rules — and stop before you erase genuine historical variation. The discipline that matters most is keeping raw text read-only and writing every change as repeatable code, so you can always say exactly what you altered and why. Treat cleaning as a documented pipeline, never as ad-hoc edits.

Noise is anything that distorts counts without carrying meaning. The hard part of cleaning historical text is that one person's noise — an archaic spelling, a long-s — is another person's evidence. So the question is never "how clean can I make this?" but "what must I fix for this analysis without losing what matters?"

How do I diagnose the noise before fixing anything?

Measure before you edit. Sample 20 random pages and tally what actually goes wrong:

python
import random, re
from collections import Counter

lines = open("raw/doc_0007.txt", encoding="utf-8").read().split("\n")
sample = random.sample(lines, min(40, len(lines)))
junk = Counter()
for ln in sample:
    if re.search(r"\b\w*[0-9]\w*\b", ln):  junk["digit-in-word"] += 1
    if re.search(r"-\s*$", ln):            junk["hyphen-break"] += 1
    if re.match(r"^\s*\d+\s*$", ln):       junk["page-number"] += 1
print(junk)

This tells you which problems are frequent and worth automating versus rare ones better left alone. Cleaning effort should follow the diagnosis, not a generic checklist.

What is a safe order for cleaning steps?

Order matters because each step changes what the next one sees. A reliable sequence:

  1. Encoding — force UTF-8, strip BOM and control characters.
  2. Layout — rejoin hyphenated line breaks, drop running headers and bare page numbers.
  3. Whitespace — collapse multiple spaces and blank lines.
  4. OCR substitutions — apply a reviewed, conservative correction list.
  5. (Later, separate stage) — case-folding, normalisation, tokenisation.

Steps 1–4 are mechanical and reversible in principle; step 5 is interpretive and belongs in the analysis pipeline, not the stored corpus.

How do I fix layout artefacts in code?

The layout layer is where the biggest, safest gains live:

python
import re

def fix_layout(text: str) -> str:
    text = re.sub(r"-\n(\w)", r"\1", text)               # rejoin word split across lines
    text = re.sub(r"^\s*\d+\s*$", "", text, flags=re.M)  # bare page numbers
    text = re.sub(r"\n{3,}", "\n\n", text)               # collapse blank runs
    return text

Running headers (CHAPTER IV — 47) repeat on every page; detect them by finding short lines that recur with high frequency across the document, then remove only those.

How should I handle OCR errors without overcorrecting?

Auto-correcting every token will invent words and hide real archaic spellings. Instead, build a small reviewed substitution map for the errors your diagnosis flagged as frequent and unambiguous:

OCR readsTrue textSafe to auto-fix?
rnmOnly in known words
1 (one)l (ell)Context-dependent
tbetheYes
aridandUsually
ſ (long-s)sYes
python
SUBS = {"tbe": "the", "arid": "and", "Tbe": "The"}
def fix_ocr(tokens):
    return [SUBS.get(t, t) for t in tokens]

For pages where OCR confidence is low across the board, the honest fix is re-OCR or HTR, not a longer substitution table.

How do I keep the cleaning auditable?

Reproducibility is what separates cleaning from tampering. Three rules:

  • Raw text is read-only; pipelines write to a new derived/ folder.
  • Every rule lives in version-controlled code with a one-line comment of intent.
  • Keep a diff sample: 10 before/after lines per rule, committed alongside the code.
bash
diff <(head -200 raw/doc_0007.txt) <(head -200 derived/doc_0007.txt) | head

If a reviewer asks "what did you change?", the answer should be a script and a diff, not your memory.

When is the corpus clean enough to stop?

Clean for the analysis, not for an abstract ideal. Re-run your headline measurement after each pass:

python
# relative frequency of a target term, before vs after a cleaning pass
print(round(text.lower().count("liberty") / len(text.split()), 5))

When an extra cleaning step no longer moves that number, you have reached diminishing returns. Over-cleaning past that point only risks deleting real variation.

Key Takeaways

  • Diagnose the noise on a sample before writing any cleaning rule.
  • Clean in stages: encoding, layout, whitespace, then reviewed OCR fixes.
  • Keep mechanical cleaning separate from interpretive normalisation.
  • Never edit master files in place; write derivatives from versioned code.
  • Use a small reviewed substitution map, not blanket auto-correction.
  • Stop cleaning once it stops changing your key measurement.

Frequently Asked Questions

What counts as noise in a historical corpus?

Noise is anything that distorts counts without carrying meaning: OCR misreads like 'tbe', hyphenated line breaks, running headers, page numbers, control characters and inconsistent spelling. The goal is to remove distortion while preserving genuine historical variation.

Should I fix historical spelling during cleaning?

Keep cleaning and spelling normalisation as separate stages. Cleaning removes mechanical noise such as OCR errors and layout artefacts; normalising 'olde' to 'old' is an interpretive choice you should make consciously, log, and be able to switch off.

How do I clean without destroying evidence?

Never edit your master files in place. Keep raw text read-only, write every cleaning step as code that produces a new derivative, and store before-and-after samples so you can audit exactly what each rule changed.

What is the single most common cleaning mistake?

Lowercasing and stripping punctuation too early, before sentence segmentation or named-entity work. Those steps destroy signal you cannot recover, so push them as late in the pipeline as the analysis allows.

How do I deal with OCR errors specifically?

Measure them first by sampling, then target the frequent, safe substitutions (such as 'rn' read as 'm') with a reviewed dictionary, and flag low-confidence pages for re-OCR rather than guessing. Do not attempt to auto-correct every token.

How do I know when the corpus is clean enough?

Stop when remaining noise no longer changes your results: re-run your key measurement after each cleaning pass and watch for the point where extra cleaning stops moving the numbers. Clean for the analysis, not for perfection.