Appearance
To fuzzy match names in Python, use the rapidfuzz library: normalise each name, then score pairs with a chosen scorer such as token_sort_ratio, and accept matches above a threshold you calibrate on your own data. Fuzzy matching reconciles the messy reality of historical names — Jno. Smith / John Smith / J. Smyth — into linked records. But the algorithm is the easy part; the craft lies in normalising first, choosing the right scorer, blocking for speed, and reviewing borderline matches by hand so you never merge two different people silently.
Which library should you use?
Use rapidfuzz — fast, permissively licensed, and the modern successor to the once-popular fuzzywuzzy. A first match takes three lines:
python
from rapidfuzz import fuzz
fuzz.ratio("John Smith", "Jon Smith") # 89.5
fuzz.token_sort_ratio("Smith, John", "John Smith") # 100.0Scores run 0–100. The leap from "John Smith" vs "Smith, John" scoring poorly on ratio but perfectly on token_sort_ratio shows why scorer choice matters.
Why normalise before you match?
Normalisation is the single highest-leverage step — often more important than the scorer. Standardise everything you can before comparing:
python
import re, unicodedata
def normalise(name):
name = name.lower().strip()
name = unicodedata.normalize("NFKD", name) # strip accents
name = name.encode("ascii", "ignore").decode()
name = re.sub(r"\b(mr|mrs|dr|sir|esq|jnr|snr)\b\.?", "", name)
name = re.sub(r"[^\w\s]", " ", name) # drop punctuation
return re.sub(r"\s+", " ", name).strip()
normalise("Dr. José Smith-Jones, Esq.") # 'jose smith jones'Cleaning titles, punctuation and accents consistently means your scorer compares the names, not the noise around them.
Which scorer fits which problem?
| Scorer | Best for | Example where it wins |
|---|---|---|
ratio | Whole-string typos | Smyth vs Smith |
partial_ratio | One name inside another | John Smith in John Smith of Leeds |
token_sort_ratio | Reordered words | Smith, John vs John Smith |
token_set_ratio | Extra/missing words | John Smith vs John Henry Smith |
For person names with inconsistent ordering, token_sort_ratio is usually the safest default; switch to token_set_ratio when middle names appear and disappear.
How do you match against a reference list?
The common task is linking a messy column to an authority list. Use rapidfuzz.process.extractOne:
python
from rapidfuzz import process, fuzz
authority = ["John Smith", "Jane Smith", "John Smithson"]
best = process.extractOne(
normalise("Jno Smith"),
[normalise(a) for a in authority],
scorer=fuzz.token_sort_ratio,
)
print(best) # ('john smith', 95.0, 0) -> match, score, indexThe returned index lets you map back to the original authority entry, so you keep the canonical spelling while matching the variant.
How do you keep it fast on thousands of names?
Comparing every name against every other is O(n²) — 10,000 names is 100 million comparisons. Blocking slashes this by only comparing names that share a cheap key:
python
from collections import defaultdict
blocks = defaultdict(list)
for n in names:
key = normalise(n)[:1] # first letter; or use a Soundex/metaphone code
blocks[key].append(n)
# now only compare within each blockBetter keys are phonetic codes (Soundex, Metaphone via the jellyfish library), which group Smith and Smyth together. For the scoring itself, rapidfuzz.process.cdist computes a full similarity matrix in optimised C, far faster than a Python loop.
How do you choose a threshold without corrupting your data?
There is no magic cutoff — calibrate it. Score all candidate pairs, sort by score, and review the band around your threshold by hand:
python
matches = [(a, b, fuzz.token_sort_ratio(normalise(a), normalise(b)))
for a, b in candidate_pairs]
borderline = [m for m in matches if 80 <= m[2] < 92] # eyeball theseA starting threshold of 85–90 is reasonable, but the cost of a false match (silently merging two real people) is usually far higher than a missed one, so err toward manual review of the grey zone.
How does this relate to record linkage?
Fuzzy name matching is one signal. Full record linkage weighs several fields — name, place, date of birth, occupation — and decides whether two records are the same entity, often with a probabilistic model (libraries like recordlinkage or splink). Treat the fuzzy score as a feature feeding that decision, not the decision itself.
Key Takeaways
- Use
rapidfuzz; it is fast, well-licensed, and replacesfuzzywuzzy. - Normalise names first — accents, titles, punctuation, case — it often beats scorer choice.
- Pick the scorer to fit the problem;
token_sort_ratiohandles reordered names well. - Use
process.extractOneto match a variant against an authority list and keep the canonical spelling. - Block on a phonetic or first-letter key to avoid O(n²) blow-up; use
cdistfor speed. - Calibrate the threshold on your data and manually review the borderline band.
- Remember fuzzy matching is one ingredient of record linkage, not the whole task.
Frequently Asked Questions
What library should I use to fuzzy match names in Python?
Use rapidfuzz. It is a fast, MIT-licensed library offering Levenshtein-based ratios and several scorers, and it has effectively replaced the older fuzzywuzzy. Install it with pip install rapidfuzz.
What's the difference between ratio, partial_ratio and token_sort_ratio?
ratio compares whole strings; partial_ratio finds the best matching substring (good when one name is contained in another); token_sort_ratio sorts the words first, so 'Smith, John' and 'John Smith' score as equal.
Should I normalise names before matching?
Yes — it is the highest-impact step. Lowercase, strip punctuation and titles, expand or remove abbreviations, and handle accents consistently. Good normalisation often matters more than the choice of scorer.
What similarity threshold should I use?
There is no universal number; calibrate it on your own data. A common starting point is 85-90 on rapidfuzz's 0-100 scale, but always review borderline matches by hand because a wrong merge corrupts your data silently.
How do I match thousands of names without it taking forever?
Use blocking — only compare names that share a cheap key such as the same first letter, Soundex code, or birth decade — and rapidfuzz's cdist for vectorised scoring. This cuts the comparisons from millions to thousands.
Is fuzzy matching the same as record linkage?
No. Fuzzy matching scores string similarity; record linkage decides whether two records are the same entity using several fields and a model. Fuzzy matching is one ingredient of record linkage, not the whole task.