Fuzzy match names in Python: A Practical Guide

To fuzzy match names in Python, use the rapidfuzz library: normalise each name, then score pairs with a chosen scorer such as token_sort_ratio, and accept matches above a threshold you calibrate on your own data. Fuzzy matching reconciles the messy reality of historical names — Jno. Smith / John Smith / J. Smyth — into linked records. But the algorithm is the easy part; the craft lies in normalising first, choosing the right scorer, blocking for speed, and reviewing borderline matches by hand so you never merge two different people silently.

Which library should you use?

Use rapidfuzz — fast, permissively licensed, and the modern successor to the once-popular fuzzywuzzy. A first match takes three lines:

python

from rapidfuzz import fuzz

fuzz.ratio("John Smith", "Jon Smith")          # 89.5
fuzz.token_sort_ratio("Smith, John", "John Smith")  # 100.0

Scores run 0–100. The leap from "John Smith" vs "Smith, John" scoring poorly on ratio but perfectly on token_sort_ratio shows why scorer choice matters.

Why normalise before you match?

Normalisation is the single highest-leverage step — often more important than the scorer. Standardise everything you can before comparing:

python

import re, unicodedata

def normalise(name):
    name = name.lower().strip()
    name = unicodedata.normalize("NFKD", name)              # strip accents
    name = name.encode("ascii", "ignore").decode()
    name = re.sub(r"\b(mr|mrs|dr|sir|esq|jnr|snr)\b\.?", "", name)
    name = re.sub(r"[^\w\s]", " ", name)                    # drop punctuation
    return re.sub(r"\s+", " ", name).strip()

normalise("Dr. José Smith-Jones, Esq.")   # 'jose smith jones'

Cleaning titles, punctuation and accents consistently means your scorer compares the names, not the noise around them.

Which scorer fits which problem?

Scorer	Best for	Example where it wins
`ratio`	Whole-string typos	`Smyth` vs `Smith`
`partial_ratio`	One name inside another	`John Smith` in `John Smith of Leeds`
`token_sort_ratio`	Reordered words	`Smith, John` vs `John Smith`
`token_set_ratio`	Extra/missing words	`John Smith` vs `John Henry Smith`

For person names with inconsistent ordering, token_sort_ratio is usually the safest default; switch to token_set_ratio when middle names appear and disappear.

How do you match against a reference list?

The common task is linking a messy column to an authority list. Use rapidfuzz.process.extractOne:

python

from rapidfuzz import process, fuzz

authority = ["John Smith", "Jane Smith", "John Smithson"]

best = process.extractOne(
    normalise("Jno Smith"),
    [normalise(a) for a in authority],
    scorer=fuzz.token_sort_ratio,
)
print(best)   # ('john smith', 95.0, 0)  -> match, score, index

The returned index lets you map back to the original authority entry, so you keep the canonical spelling while matching the variant.

How do you keep it fast on thousands of names?

Comparing every name against every other is O(n²) — 10,000 names is 100 million comparisons. Blocking slashes this by only comparing names that share a cheap key:

python

from collections import defaultdict

blocks = defaultdict(list)
for n in names:
    key = normalise(n)[:1]          # first letter; or use a Soundex/metaphone code
    blocks[key].append(n)

# now only compare within each block

Better keys are phonetic codes (Soundex, Metaphone via the jellyfish library), which group Smith and Smyth together. For the scoring itself, rapidfuzz.process.cdist computes a full similarity matrix in optimised C, far faster than a Python loop.

How do you choose a threshold without corrupting your data?

There is no magic cutoff — calibrate it. Score all candidate pairs, sort by score, and review the band around your threshold by hand:

python

matches = [(a, b, fuzz.token_sort_ratio(normalise(a), normalise(b)))
           for a, b in candidate_pairs]
borderline = [m for m in matches if 80 <= m[2] < 92]   # eyeball these

A starting threshold of 85–90 is reasonable, but the cost of a false match (silently merging two real people) is usually far higher than a missed one, so err toward manual review of the grey zone.

How does this relate to record linkage?

Fuzzy name matching is one signal. Full record linkage weighs several fields — name, place, date of birth, occupation — and decides whether two records are the same entity, often with a probabilistic model (libraries like recordlinkage or splink). Treat the fuzzy score as a feature feeding that decision, not the decision itself.

Key Takeaways

Use rapidfuzz; it is fast, well-licensed, and replaces fuzzywuzzy.
Normalise names first — accents, titles, punctuation, case — it often beats scorer choice.
Pick the scorer to fit the problem; token_sort_ratio handles reordered names well.
Use process.extractOne to match a variant against an authority list and keep the canonical spelling.
Block on a phonetic or first-letter key to avoid O(n²) blow-up; use cdist for speed.
Calibrate the threshold on your data and manually review the borderline band.
Remember fuzzy matching is one ingredient of record linkage, not the whole task.

Frequently Asked Questions

What library should I use to fuzzy match names in Python?

Use rapidfuzz. It is a fast, MIT-licensed library offering Levenshtein-based ratios and several scorers, and it has effectively replaced the older fuzzywuzzy. Install it with pip install rapidfuzz.

What's the difference between ratio, partial_ratio and token_sort_ratio?

ratio compares whole strings; partial_ratio finds the best matching substring (good when one name is contained in another); token_sort_ratio sorts the words first, so 'Smith, John' and 'John Smith' score as equal.

Should I normalise names before matching?

Yes — it is the highest-impact step. Lowercase, strip punctuation and titles, expand or remove abbreviations, and handle accents consistently. Good normalisation often matters more than the choice of scorer.

What similarity threshold should I use?

There is no universal number; calibrate it on your own data. A common starting point is 85-90 on rapidfuzz's 0-100 scale, but always review borderline matches by hand because a wrong merge corrupts your data silently.

How do I match thousands of names without it taking forever?

Use blocking — only compare names that share a cheap key such as the same first letter, Soundex code, or birth decade — and rapidfuzz's cdist for vectorised scoring. This cuts the comparisons from millions to thousands.

Is fuzzy matching the same as record linkage?

No. Fuzzy matching scores string similarity; record linkage decides whether two records are the same entity using several fields and a model. Fuzzy matching is one ingredient of record linkage, not the whole task.

Which library should you use? ​

Why normalise before you match? ​

Which scorer fits which problem? ​

How do you match against a reference list? ​

How do you keep it fast on thousands of names? ​

How do you choose a threshold without corrupting your data? ​

How does this relate to record linkage? ​

Key Takeaways ​

Frequently Asked Questions ​

What library should I use to fuzzy match names in Python? ​

What's the difference between ratio, partial_ratio and token_sort_ratio? ​

Should I normalise names before matching? ​

What similarity threshold should I use? ​

How do I match thousands of names without it taking forever? ​

Is fuzzy matching the same as record linkage? ​

Related reading ​

Which library should you use?

Why normalise before you match?

Which scorer fits which problem?

How do you match against a reference list?

How do you keep it fast on thousands of names?

How do you choose a threshold without corrupting your data?

How does this relate to record linkage?

Key Takeaways

Frequently Asked Questions

What library should I use to fuzzy match names in Python?

What's the difference between ratio, partial_ratio and token_sort_ratio?

Should I normalise names before matching?

What similarity threshold should I use?

How do I match thousands of names without it taking forever?

Is fuzzy matching the same as record linkage?

Related reading