Troubleshooting: Use regex on historical corpora

Q: Why does my regex miss accented or long-s characters?

Because '[a-z]' only matches ASCII. Switch to Unicode-aware classes like '\\p{L}' (Python 'regex' module) or '[^\\W\\d_]' and make sure the file is read as UTF-8, not Latin-1.

Q: Why is my regex catastrophically slow on long documents?

Nested or overlapping quantifiers like '(\\w+)+' cause catastrophic backtracking. Replace them with possessive quantifiers, atomic groups, or a more specific pattern, and test on your longest file first.

When regex behaves unexpectedly on historical corpora, the cause is almost always one of three things: an encoding mismatch so your bytes are not the characters you think, an ASCII-only character class that silently ignores accents and long-s, or catastrophic backtracking that hangs on long pages. Diagnose by isolating one failing line, printing its repr, and testing the pattern on that single string before blaming the whole pipeline.

Why does my regex match nothing on a file that clearly contains the word?

Nine times out of ten this is an encoding problem. The file was saved as UTF-8 but opened as Latin-1 (or vice versa), so café arrives as cafÃ© and your literal pattern never fires. Print the raw bytes to confirm:

python

with open("doc.txt", "rb") as f:
    print(f.read(80))          # look for stray Ã, Â, or \xef\xbb\xbf BOM

Re-open with the correct codec (encoding="utf-8-sig" strips a byte-order mark) and the matches return. Fix encoding first; every other regex fix depends on the characters being right.

Why does my character class miss long-s, accents and ligatures?

[a-z] is ASCII only. Historical text is full of ſ (long-s), æ, œ and accented vowels that fall outside that range. Use Unicode-aware matching with the third-party regex module, which supports \p{L} for "any letter":

python

import regex
words = regex.findall(r"\p{L}+", text)   # catches ſæœ, é, ñ, etc.

If you must stay in the standard library, re already treats \w as Unicode for str input, but a hand-written [a-z] does not. Prefer \w or [^\W\d_] over manual ranges.

How do I handle words split by an OCR line-break hyphen?

Digitised print frequently breaks words across lines with a hyphen: beauti- then ful on the next line. A word-level pattern misses both halves. Two fixes, in order of preference:

python

# 1. Pre-join soft hyphens before any other matching
text = regex.sub(r"(\p{L})-\s*\n\s*(\p{L})", r"\1\2", text)

# 2. Or match across the break inline
regex.findall(r"beauti-?\s*ful", text)

Pre-joining is cleaner because it fixes the corpus once, so downstream tokenisation and n-gram counts also benefit.

Why is my regex hanging on long documents?

A pattern that runs instantly on a sentence can hang for minutes on a full page. The culprit is catastrophic backtracking from patterns like (\w+\s*)+, where the engine tries exponentially many ways to split the same text. The table summarises the usual offenders and their fixes.

Symptom	Likely cause	Fix
Hangs on long input	nested quantifiers `(a+)+`	possessive `(a+)++` or atomic group `(?>a+)`
Slow with many alternations	overlapping branches	order branches, anchor with `^`/`\b`
Matches too much	greedy `.*`	lazy `.*?` or a specific class
Spans unwanted lines	DOTALL left on globally	scope `(?s:...)` to one group

Always benchmark a new pattern on your single longest file, not a toy string.

Why are my matches greedy when I wanted the shortest span?

.* is greedy and grabs as much as possible, so <.*> swallows everything between the first < and the last >. Make it lazy with .*?, or better, exclude the delimiter: <[^>]*>. Excluding the delimiter is faster and clearer than relying on laziness.

How do I keep regex fixes reproducible across a collection?

Store patterns in a versioned file with a comment explaining each one, rather than retyping them in notebooks. Run the same ordered list of substitutions over every document, log how many replacements each made, and spot-check the largest counts. A substitution that suddenly fires ten times more often on one volume is a signal that the volume differs, not that the regex is wrong.

Key Takeaways

Confirm encoding first; print raw bytes when matches mysteriously fail.
Replace ASCII [a-z] ranges with Unicode-aware \p{L} or \w to catch long-s and accents.
Pre-join soft hyphens across line breaks so the whole pipeline benefits.
Diagnose hangs as catastrophic backtracking and fix with atomic or possessive quantifiers.
Prefer excluding delimiters ([^>]*) over relying on lazy quantifiers.
Version your patterns and log replacement counts to keep cleaning reproducible.

Frequently Asked Questions

Why does my regex miss accented or long-s characters?

Because [a-z] only matches ASCII. Switch to Unicode-aware classes like \p{L} (Python regex module) or [^\W\d_] and make sure the file is read as UTF-8, not Latin-1.

Why is my regex catastrophically slow on long documents?

Nested or overlapping quantifiers like (\w+)+ cause catastrophic backtracking. Replace them with possessive quantifiers, atomic groups, or a more specific pattern, and test on your longest file first.

How do I match a word across an OCR line-break hyphen?

Match the hyphen plus optional whitespace and newline, for example beauti-\s*ful, or pre-process by joining lines that end in a hyphen before running word-level patterns.

Why does '.' not match my newlines?

By default the dot excludes newline characters. Enable DOTALL/single-line mode (the re.DOTALL flag or (?s) inline) when you need to span lines, but prefer explicit character classes for clarity.

Should I lowercase text before or after matching?

It depends on intent. Lowercase first when case is irrelevant to the match; keep original case and use a case-insensitive flag when you still need the surface form in your output.

Why do my capture groups return None on some matches?

Optional groups that did not participate in a match return None, not an empty string. Guard against it in code, or restructure the pattern so the group is always present.

Why does my regex match nothing on a file that clearly contains the word? ​

Why does my character class miss long-s, accents and ligatures? ​

How do I handle words split by an OCR line-break hyphen? ​

Why is my regex hanging on long documents? ​

Why are my matches greedy when I wanted the shortest span? ​

How do I keep regex fixes reproducible across a collection? ​

Key Takeaways ​

Frequently Asked Questions ​

Why does my regex miss accented or long-s characters? ​

Why is my regex catastrophically slow on long documents? ​

How do I match a word across an OCR line-break hyphen? ​

Why does '.' not match my newlines? ​

Should I lowercase text before or after matching? ​

Why do my capture groups return None on some matches? ​

Related reading ​

Why does my regex match nothing on a file that clearly contains the word?

Why does my character class miss long-s, accents and ligatures?

How do I handle words split by an OCR line-break hyphen?

Why is my regex hanging on long documents?

Why are my matches greedy when I wanted the shortest span?

How do I keep regex fixes reproducible across a collection?

Key Takeaways

Frequently Asked Questions

Why does my regex miss accented or long-s characters?

Why is my regex catastrophically slow on long documents?

How do I match a word across an OCR line-break hyphen?

Why does '.' not match my newlines?

Should I lowercase text before or after matching?

Why do my capture groups return None on some matches?

Related reading