Appearance
When regex behaves unexpectedly on historical corpora, the cause is almost always one of three things: an encoding mismatch so your bytes are not the characters you think, an ASCII-only character class that silently ignores accents and long-s, or catastrophic backtracking that hangs on long pages. Diagnose by isolating one failing line, printing its repr, and testing the pattern on that single string before blaming the whole pipeline.
Why does my regex match nothing on a file that clearly contains the word?
Nine times out of ten this is an encoding problem. The file was saved as UTF-8 but opened as Latin-1 (or vice versa), so café arrives as café and your literal pattern never fires. Print the raw bytes to confirm:
python
with open("doc.txt", "rb") as f:
print(f.read(80)) # look for stray Ã, Â, or \xef\xbb\xbf BOMRe-open with the correct codec (encoding="utf-8-sig" strips a byte-order mark) and the matches return. Fix encoding first; every other regex fix depends on the characters being right.
Why does my character class miss long-s, accents and ligatures?
[a-z] is ASCII only. Historical text is full of ſ (long-s), æ, œ and accented vowels that fall outside that range. Use Unicode-aware matching with the third-party regex module, which supports \p{L} for "any letter":
python
import regex
words = regex.findall(r"\p{L}+", text) # catches ſæœ, é, ñ, etc.If you must stay in the standard library, re already treats \w as Unicode for str input, but a hand-written [a-z] does not. Prefer \w or [^\W\d_] over manual ranges.
How do I handle words split by an OCR line-break hyphen?
Digitised print frequently breaks words across lines with a hyphen: beauti- then ful on the next line. A word-level pattern misses both halves. Two fixes, in order of preference:
python
# 1. Pre-join soft hyphens before any other matching
text = regex.sub(r"(\p{L})-\s*\n\s*(\p{L})", r"\1\2", text)
# 2. Or match across the break inline
regex.findall(r"beauti-?\s*ful", text)Pre-joining is cleaner because it fixes the corpus once, so downstream tokenisation and n-gram counts also benefit.
Why is my regex hanging on long documents?
A pattern that runs instantly on a sentence can hang for minutes on a full page. The culprit is catastrophic backtracking from patterns like (\w+\s*)+, where the engine tries exponentially many ways to split the same text. The table summarises the usual offenders and their fixes.
| Symptom | Likely cause | Fix |
|---|---|---|
| Hangs on long input | nested quantifiers (a+)+ | possessive (a+)++ or atomic group (?>a+) |
| Slow with many alternations | overlapping branches | order branches, anchor with ^/\b |
| Matches too much | greedy .* | lazy .*? or a specific class |
| Spans unwanted lines | DOTALL left on globally | scope (?s:...) to one group |
Always benchmark a new pattern on your single longest file, not a toy string.
Why are my matches greedy when I wanted the shortest span?
.* is greedy and grabs as much as possible, so <.*> swallows everything between the first < and the last >. Make it lazy with .*?, or better, exclude the delimiter: <[^>]*>. Excluding the delimiter is faster and clearer than relying on laziness.
How do I keep regex fixes reproducible across a collection?
Store patterns in a versioned file with a comment explaining each one, rather than retyping them in notebooks. Run the same ordered list of substitutions over every document, log how many replacements each made, and spot-check the largest counts. A substitution that suddenly fires ten times more often on one volume is a signal that the volume differs, not that the regex is wrong.
Key Takeaways
- Confirm encoding first; print raw bytes when matches mysteriously fail.
- Replace ASCII
[a-z]ranges with Unicode-aware\p{L}or\wto catch long-s and accents. - Pre-join soft hyphens across line breaks so the whole pipeline benefits.
- Diagnose hangs as catastrophic backtracking and fix with atomic or possessive quantifiers.
- Prefer excluding delimiters (
[^>]*) over relying on lazy quantifiers. - Version your patterns and log replacement counts to keep cleaning reproducible.
Frequently Asked Questions
Why does my regex miss accented or long-s characters?
Because [a-z] only matches ASCII. Switch to Unicode-aware classes like \p{L} (Python regex module) or [^\W\d_] and make sure the file is read as UTF-8, not Latin-1.
Why is my regex catastrophically slow on long documents?
Nested or overlapping quantifiers like (\w+)+ cause catastrophic backtracking. Replace them with possessive quantifiers, atomic groups, or a more specific pattern, and test on your longest file first.
How do I match a word across an OCR line-break hyphen?
Match the hyphen plus optional whitespace and newline, for example beauti-\s*ful, or pre-process by joining lines that end in a hyphen before running word-level patterns.
Why does '.' not match my newlines?
By default the dot excludes newline characters. Enable DOTALL/single-line mode (the re.DOTALL flag or (?s) inline) when you need to span lines, but prefer explicit character classes for clarity.
Should I lowercase text before or after matching?
It depends on intent. Lowercase first when case is irrelevant to the match; keep original case and use a case-insensitive flag when you still need the surface form in your output.
Why do my capture groups return None on some matches?
Optional groups that did not participate in a match return None, not an empty string. Guard against it in code, or restructure the pattern so the group is always present.