Skip to content
Python for Historians

A regular expression ("regex") is a compact pattern that tells Python what text to find — say, "four digits that look like a year" or "a parish reference like P/123" — and Python's built-in re module then finds, counts or replaces every match across your source. For historians, regex is the fastest way to pull structured fragments out of unstructured transcriptions: dates, sums of money, catalogue references, place names in a fixed format. This guide starts from zero, uses one running example, and shows where regex helps and where it should never be used.

What does a regex actually look like?

A pattern is just text plus a few special symbols. The most useful building blocks:

PatternMeansMatches
\dany digit7
\d{4}exactly four digits1851
\w+one or more letters/digitsSmith
[A-Z]one capital letterP
?the previous item is optionalmakes a letter optional
|orthis | that

Combine them and you describe a shape of text rather than one exact string.

How do you run a pattern in Python?

Import re, write your pattern as a raw string (r"..."), and pick a function. Here is the running example — a line from a transcribed account book:

python
import re

line = "Paid to Jno. Smith, the 14 March 1851, the sum of £3 4s 6d"

year = re.search(r"\d{4}", line)
print(year.group())          # 1851

money = re.findall(r"£\d+", line)
print(money)                 # ['£3']

re.search returns the first match (or None); re.findall returns every match as a list. Those two functions already cover a large share of real work.

Why must you use raw strings?

In a normal Python string, \d means "backslash-d" and Python tries to interpret the backslash. A raw string, written r"\d{4}", passes the backslash straight to the regex engine. Forgetting the r is the single most common beginner failure — the pattern looks right but matches nothing.

How do you capture just part of a match?

Wrap the bit you want in parentheses — that creates a capture group. Named groups make the result self-documenting:

python
m = re.search(r"the (?P<day>\d{1,2}) (?P<month>\w+) (?P<year>\d{4})", line)
print(m.group("day"))    # 14
print(m.group("month"))  # March
print(m.group("year"))   # 1851

This turns a free-text date inside a sentence into three labelled fields you can store in a spreadsheet.

How do you cope with OCR noise and spelling variation?

Historical text is irregular, so build tolerance into the pattern. Suppose a name appears as Jno., Jno, or John:

python
pattern = r"J(no\.?|ohn)"
for hit in re.finditer(pattern, "Jno. Smith and John Brown and Jno Clark"):
    print(hit.group())
# Jno.
# John
# Jno

Add re.IGNORECASE to ignore capitalisation, and use character classes for letters OCR confuses ([il1] for the i/l/1 muddle). Accept that no pattern is perfect — always review what it missed.

When should you not reach for regex?

Regex is for patterns inside plain text. It is the wrong tool for navigating structure:

  • Don't parse TEI/XML or HTML with regex — use lxml or BeautifulSoup, which understand nesting.
  • Don't parse arbitrary dates with one giant pattern — combine a focused regex with a date library.
  • Don't write a 200-character pattern — split it into named, commented pieces with re.VERBOSE.

A famous rule of thumb: if you are using regex to balance nested tags, you have the wrong tool.

How do you clean text with substitution?

re.sub replaces every match. A common chore is collapsing the runaway whitespace left by OCR:

python
clean = re.sub(r"\s+", " ", messy_text).strip()
# remove a recurring page-header artefact
clean = re.sub(r"\[page \d+\]", "", clean)

Build these substitutions up one at a time and check the result after each, rather than chaining ten replacements blind.

Key Takeaways

  • A regex is a pattern describing the shape of text to find, count, or replace.
  • Start with four functions: re.search, re.findall, re.sub, re.finditer.
  • Always write patterns as raw strings (r"...") to avoid backslash chaos.
  • Use parentheses — ideally named groups — to capture the part you want, like the year.
  • Build tolerance for OCR noise and spelling variants, and review the misses.
  • Never parse XML/HTML with regex; use a real parser for nested markup.
  • Use re.sub(r"\s+", " ", text) to tame OCR whitespace.

Frequently Asked Questions

What is a regular expression in plain terms?

A regular expression is a small pattern that describes what text to look for — for example 'four digits in a row' to find a year. Python's re module then finds, counts, or replaces every piece of text matching that pattern.

Which Python functions do I actually need to start?

Just four: re.search to find the first match, re.findall to get all matches as a list, re.sub to replace, and re.finditer when you also want each match's position. Most source work is covered by these.

Why should I use raw strings like r'...' for patterns?

A raw string stops Python from interpreting backslashes, so r'\d+' reaches the regex engine intact. Without it you fight two layers of escaping and patterns mysteriously fail.

Is regex the right tool for parsing TEI or HTML?

No. Use a proper XML or HTML parser like lxml or BeautifulSoup for structured markup. Regex is for patterns inside plain text — dates, references, currency — not for navigating nested tags.

How do I handle OCR errors and spelling variation in regex?

Build tolerance into the pattern: use character classes for letters that get confused, optional groups for variant spellings, and re.IGNORECASE. Accept that no pattern catches every messy variant, and review the misses.

How do I capture part of a match, like just the year?

Wrap the part you want in parentheses to make a capture group, then read it with match.group(1). Named groups, where you label a group with a name inside the parentheses, make the result far more readable.