Beginner's Guide to Regex on sources in Python

A regular expression ("regex") is a compact pattern that tells Python what text to find — say, "four digits that look like a year" or "a parish reference like P/123" — and Python's built-in re module then finds, counts or replaces every match across your source. For historians, regex is the fastest way to pull structured fragments out of unstructured transcriptions: dates, sums of money, catalogue references, place names in a fixed format. This guide starts from zero, uses one running example, and shows where regex helps and where it should never be used.

What does a regex actually look like?

A pattern is just text plus a few special symbols. The most useful building blocks:

Pattern	Means	Matches
`\d`	any digit	`7`
`\d{4}`	exactly four digits	`1851`
`\w+`	one or more letters/digits	`Smith`
`[A-Z]`	one capital letter	`P`
`?`	the previous item is optional	makes a letter optional
`\|`	or	this `\|` that

Combine them and you describe a shape of text rather than one exact string.

How do you run a pattern in Python?

Import re, write your pattern as a raw string (r"..."), and pick a function. Here is the running example — a line from a transcribed account book:

python

import re

line = "Paid to Jno. Smith, the 14 March 1851, the sum of £3 4s 6d"

year = re.search(r"\d{4}", line)
print(year.group())          # 1851

money = re.findall(r"£\d+", line)
print(money)                 # ['£3']

re.search returns the first match (or None); re.findall returns every match as a list. Those two functions already cover a large share of real work.

Why must you use raw strings?

In a normal Python string, \d means "backslash-d" and Python tries to interpret the backslash. A raw string, written r"\d{4}", passes the backslash straight to the regex engine. Forgetting the r is the single most common beginner failure — the pattern looks right but matches nothing.

How do you capture just part of a match?

Wrap the bit you want in parentheses — that creates a capture group. Named groups make the result self-documenting:

python

m = re.search(r"the (?P<day>\d{1,2}) (?P<month>\w+) (?P<year>\d{4})", line)
print(m.group("day"))    # 14
print(m.group("month"))  # March
print(m.group("year"))   # 1851

This turns a free-text date inside a sentence into three labelled fields you can store in a spreadsheet.

How do you cope with OCR noise and spelling variation?

Historical text is irregular, so build tolerance into the pattern. Suppose a name appears as Jno., Jno, or John:

python

pattern = r"J(no\.?|ohn)"
for hit in re.finditer(pattern, "Jno. Smith and John Brown and Jno Clark"):
    print(hit.group())
# Jno.
# John
# Jno

Add re.IGNORECASE to ignore capitalisation, and use character classes for letters OCR confuses ([il1] for the i/l/1 muddle). Accept that no pattern is perfect — always review what it missed.

When should you not reach for regex?

Regex is for patterns inside plain text. It is the wrong tool for navigating structure:

Don't parse TEI/XML or HTML with regex — use lxml or BeautifulSoup, which understand nesting.
Don't parse arbitrary dates with one giant pattern — combine a focused regex with a date library.
Don't write a 200-character pattern — split it into named, commented pieces with re.VERBOSE.

A famous rule of thumb: if you are using regex to balance nested tags, you have the wrong tool.

How do you clean text with substitution?

re.sub replaces every match. A common chore is collapsing the runaway whitespace left by OCR:

python

clean = re.sub(r"\s+", " ", messy_text).strip()
# remove a recurring page-header artefact
clean = re.sub(r"\[page \d+\]", "", clean)

Build these substitutions up one at a time and check the result after each, rather than chaining ten replacements blind.

Key Takeaways

A regex is a pattern describing the shape of text to find, count, or replace.
Start with four functions: re.search, re.findall, re.sub, re.finditer.
Always write patterns as raw strings (r"...") to avoid backslash chaos.
Use parentheses — ideally named groups — to capture the part you want, like the year.
Build tolerance for OCR noise and spelling variants, and review the misses.
Never parse XML/HTML with regex; use a real parser for nested markup.
Use re.sub(r"\s+", " ", text) to tame OCR whitespace.

Frequently Asked Questions

What is a regular expression in plain terms?

A regular expression is a small pattern that describes what text to look for — for example 'four digits in a row' to find a year. Python's re module then finds, counts, or replaces every piece of text matching that pattern.

Which Python functions do I actually need to start?

Just four: re.search to find the first match, re.findall to get all matches as a list, re.sub to replace, and re.finditer when you also want each match's position. Most source work is covered by these.

Why should I use raw strings like r'...' for patterns?

A raw string stops Python from interpreting backslashes, so r'\d+' reaches the regex engine intact. Without it you fight two layers of escaping and patterns mysteriously fail.

Is regex the right tool for parsing TEI or HTML?

No. Use a proper XML or HTML parser like lxml or BeautifulSoup for structured markup. Regex is for patterns inside plain text — dates, references, currency — not for navigating nested tags.

How do I handle OCR errors and spelling variation in regex?

Build tolerance into the pattern: use character classes for letters that get confused, optional groups for variant spellings, and re.IGNORECASE. Accept that no pattern catches every messy variant, and review the misses.

How do I capture part of a match, like just the year?

Wrap the part you want in parentheses to make a capture group, then read it with match.group(1). Named groups, where you label a group with a name inside the parentheses, make the result far more readable.

What does a regex actually look like? ​

How do you run a pattern in Python? ​

Why must you use raw strings? ​

How do you capture just part of a match? ​

How do you cope with OCR noise and spelling variation? ​

When should you not reach for regex? ​

How do you clean text with substitution? ​

Key Takeaways ​

Frequently Asked Questions ​

What is a regular expression in plain terms? ​

Which Python functions do I actually need to start? ​

Why should I use raw strings like r'...' for patterns? ​

Is regex the right tool for parsing TEI or HTML? ​

How do I handle OCR errors and spelling variation in regex? ​

How do I capture part of a match, like just the year? ​

Related reading ​

What does a regex actually look like?

How do you run a pattern in Python?

Why must you use raw strings?

How do you capture just part of a match?

How do you cope with OCR noise and spelling variation?

When should you not reach for regex?

How do you clean text with substitution?

Key Takeaways

Frequently Asked Questions

What is a regular expression in plain terms?

Which Python functions do I actually need to start?

Why should I use raw strings like r'...' for patterns?

Is regex the right tool for parsing TEI or HTML?

How do I handle OCR errors and spelling variation in regex?

How do I capture part of a match, like just the year?

Related reading