Skip to content
Python for Historians

When merging and reconciling CSV sources goes wrong, the cause is almost always one of four things: duplicate keys inflating the result, an inner join silently dropping rows, keys that look identical but differ in whitespace or type, or mismatched encodings. Diagnose by checking duplicates and using an outer join with an indicator, then fix the keys before you merge. Reconciliation rules come last, once the join itself is sound.

This troubleshooting guide takes each failure in turn, with the diagnostic and the fix.

Why does my merge explode to more rows than the inputs?

If a 5,000-row file merged with a 4,000-row file returns 60,000 rows, you have duplicate keys producing a cross-product. Every duplicate on the left multiplies against every match on the right. Diagnose first:

python
import pandas as pd
print(left.duplicated(subset="parish_id").sum())   # how many dup keys?
print(right.duplicated(subset="parish_id").sum())

Fix by deduplicating, or by joining on a composite key that is genuinely unique, such as parish plus year:

python
merged = left.merge(right, on=["parish_id", "year"], how="left")

Why are rows silently vanishing after a merge?

pandas defaults to an inner join, which keeps only keys present in both files. Records unique to one source disappear without warning. Make the loss visible:

python
merged = left.merge(right, on="parish_id", how="outer", indicator=True)
print(merged["_merge"].value_counts())
# left_only / right_only / both — now you can decide what to keep

The indicator=True flag adds a _merge column telling you the origin of every row. Use how="outer" while investigating, then narrow once you understand the overlap.

Why don't my keys match when they look identical?

This is the most maddening one: two cells print the same but will not join. The culprit is something invisible.

Hidden differenceSymptomFix
Trailing/leading spaces"York " vs "York".str.strip()
Case mismatch"york" vs "York".str.lower()
Number vs text key42 vs "42"cast both with .astype(str)
Non-breaking spacelooks like a space.str.replace(" ", " ")

Normalise both keys before merging:

python
for df in (left, right):
    df["parish_id"] = df["parish_id"].astype(str).str.strip().str.lower()

How do I merge sources that spell names differently?

Exact joins cannot bridge "Macdonald" and "McDonald". After normalising, fall back to fuzzy matching for the residue:

python
from rapidfuzz import process, fuzz

def best_match(name, choices, threshold=90):
    match, score, _ = process.extractOne(name, choices, scorer=fuzz.ratio)
    return match if score >= threshold else None

Always keep the score so you can audit weak matches. A blind fuzzy merge introduces silent errors that are worse than an honest non-match.

How do I reconcile conflicting values?

Once the join works, the same record may carry different values in each source. Keep both with clear suffixes, then apply an explicit rule:

python
merged = left.merge(right, on="parish_id", suffixes=("_reg", "_census"))
# prefer the register where it exists, else the census
merged["pop"] = merged["pop_reg"].fillna(merged["pop_census"])

State the reconciliation rule openly — prefer the more authoritative source, the later record, or flag the disagreement for manual review. Never let one source quietly overwrite another without a documented reason.

What encoding problems break a merge?

Garbled accents or a UnicodeDecodeError on load mean mismatched encodings. Open every file with an explicit encoding, and detect rather than guess for stubborn legacy files:

python
import pandas as pd
df = pd.read_csv("old_file.csv", encoding="utf-8")   # try this first
# if it fails, detect the encoding, then re-save as UTF-8 before merging

Convert everything to UTF-8 up front so the merge operates on consistent text.

Key Takeaways

  • Row explosions mean duplicate keys; check with duplicated() and use a composite or deduplicated key.
  • An inner join drops unmatched rows silently — use how="outer" with indicator=True to see the loss.
  • Keys that look identical often differ by whitespace, case, or type; normalise both before merging.
  • Bridge spelling variants with rapidfuzz, and always keep the match score for auditing.
  • Reconcile conflicts with explicit, documented rules — never let one source silently overwrite another.
  • Force a single encoding (UTF-8) across all files before merging to avoid garbled text.

Frequently Asked Questions

Why does my merge produce far more rows than either input?

You have duplicate keys on one or both sides, so pandas creates the cross-product of matches. Check for duplicates with df.duplicated(subset=key) before merging, and deduplicate or use a composite key.

Why are rows silently disappearing after a merge?

A default inner join drops any key that is not in both files. Use how="outer" with indicator=True to see which rows came from the left, right, or both, then decide what to keep.

How do I merge when the same name is spelled differently across sources?

Exact joins fail on spelling variants, so normalise first — trim whitespace, fix case, standardise punctuation — and fall back to fuzzy matching with rapidfuzz for names that still do not align exactly.

Why do my join keys not match even though they look identical?

Invisible differences are the usual cause: trailing spaces, mixed case, non-breaking spaces, or one side stored as a number and the other as text. Strip, lowercase, and cast both keys to the same type before merging.

How do I handle conflicting values for the same record?

Keep both columns with clear suffixes using the suffixes argument, then apply an explicit reconciliation rule — prefer the more authoritative source, the later record, or flag the conflict for manual review.

What encoding problems break CSV merges?

Mixed encodings cause garbled characters or load errors. Open each file with an explicit encoding="utf-8", and if a legacy file fails, detect its encoding rather than guessing, then convert it to UTF-8 before merging.