Appearance
A regular expression ("regex") is a compact pattern that tells Python what text to find — say, "four digits that look like a year" or "a parish reference like P/123" — and Python's built-in re module then finds, counts or replaces every match across your source. For historians, regex is the fastest way to pull structured fragments out of unstructured transcriptions: dates, sums of money, catalogue references, place names in a fixed format. This guide starts from zero, uses one running example, and shows where regex helps and where it should never be used.
What does a regex actually look like?
A pattern is just text plus a few special symbols. The most useful building blocks:
| Pattern | Means | Matches |
|---|---|---|
\d | any digit | 7 |
\d{4} | exactly four digits | 1851 |
\w+ | one or more letters/digits | Smith |
[A-Z] | one capital letter | P |
? | the previous item is optional | makes a letter optional |
| | or | this | that |
Combine them and you describe a shape of text rather than one exact string.
How do you run a pattern in Python?
Import re, write your pattern as a raw string (r"..."), and pick a function. Here is the running example — a line from a transcribed account book:
python
import re
line = "Paid to Jno. Smith, the 14 March 1851, the sum of £3 4s 6d"
year = re.search(r"\d{4}", line)
print(year.group()) # 1851
money = re.findall(r"£\d+", line)
print(money) # ['£3']re.search returns the first match (or None); re.findall returns every match as a list. Those two functions already cover a large share of real work.
Why must you use raw strings?
In a normal Python string, \d means "backslash-d" and Python tries to interpret the backslash. A raw string, written r"\d{4}", passes the backslash straight to the regex engine. Forgetting the r is the single most common beginner failure — the pattern looks right but matches nothing.
How do you capture just part of a match?
Wrap the bit you want in parentheses — that creates a capture group. Named groups make the result self-documenting:
python
m = re.search(r"the (?P<day>\d{1,2}) (?P<month>\w+) (?P<year>\d{4})", line)
print(m.group("day")) # 14
print(m.group("month")) # March
print(m.group("year")) # 1851This turns a free-text date inside a sentence into three labelled fields you can store in a spreadsheet.
How do you cope with OCR noise and spelling variation?
Historical text is irregular, so build tolerance into the pattern. Suppose a name appears as Jno., Jno, or John:
python
pattern = r"J(no\.?|ohn)"
for hit in re.finditer(pattern, "Jno. Smith and John Brown and Jno Clark"):
print(hit.group())
# Jno.
# John
# JnoAdd re.IGNORECASE to ignore capitalisation, and use character classes for letters OCR confuses ([il1] for the i/l/1 muddle). Accept that no pattern is perfect — always review what it missed.
When should you not reach for regex?
Regex is for patterns inside plain text. It is the wrong tool for navigating structure:
- Don't parse TEI/XML or HTML with regex — use
lxmlor BeautifulSoup, which understand nesting. - Don't parse arbitrary dates with one giant pattern — combine a focused regex with a date library.
- Don't write a 200-character pattern — split it into named, commented pieces with
re.VERBOSE.
A famous rule of thumb: if you are using regex to balance nested tags, you have the wrong tool.
How do you clean text with substitution?
re.sub replaces every match. A common chore is collapsing the runaway whitespace left by OCR:
python
clean = re.sub(r"\s+", " ", messy_text).strip()
# remove a recurring page-header artefact
clean = re.sub(r"\[page \d+\]", "", clean)Build these substitutions up one at a time and check the result after each, rather than chaining ten replacements blind.
Key Takeaways
- A regex is a pattern describing the shape of text to find, count, or replace.
- Start with four functions:
re.search,re.findall,re.sub,re.finditer. - Always write patterns as raw strings (
r"...") to avoid backslash chaos. - Use parentheses — ideally named groups — to capture the part you want, like the year.
- Build tolerance for OCR noise and spelling variants, and review the misses.
- Never parse XML/HTML with regex; use a real parser for nested markup.
- Use
re.sub(r"\s+", " ", text)to tame OCR whitespace.
Frequently Asked Questions
What is a regular expression in plain terms?
A regular expression is a small pattern that describes what text to look for — for example 'four digits in a row' to find a year. Python's re module then finds, counts, or replaces every piece of text matching that pattern.
Which Python functions do I actually need to start?
Just four: re.search to find the first match, re.findall to get all matches as a list, re.sub to replace, and re.finditer when you also want each match's position. Most source work is covered by these.
Why should I use raw strings like r'...' for patterns?
A raw string stops Python from interpreting backslashes, so r'\d+' reaches the regex engine intact. Without it you fight two layers of escaping and patterns mysteriously fail.
Is regex the right tool for parsing TEI or HTML?
No. Use a proper XML or HTML parser like lxml or BeautifulSoup for structured markup. Regex is for patterns inside plain text — dates, references, currency — not for navigating nested tags.
How do I handle OCR errors and spelling variation in regex?
Build tolerance into the pattern: use character classes for letters that get confused, optional groups for variant spellings, and re.IGNORECASE. Accept that no pattern catches every messy variant, and review the misses.
How do I capture part of a match, like just the year?
Wrap the part you want in parentheses to make a capture group, then read it with match.group(1). Named groups, where you label a group with a name inside the parentheses, make the result far more readable.