Troubleshooting: Analyse historical data with pandas

When pandas misbehaves on historical data, the cause is almost always the data, not the code: silent type coercion that eats leading zeros, the wrong text encoding, non-numeric values poisoning a column, or mismatched keys on a merge. The fastest diagnostic is to load everything as strings first (dtype=str), look at df.dtypes, df.head() and df.isna().sum(), and only then convert columns deliberately. This article walks through the handful of errors that account for most pandas pain in archival work and the fix for each.

Why did my catalogue references lose their leading zeros?

This is the classic. A reference like 00123 or 0042/B gets read as an integer, dropping the zeros that make it a valid identifier. pandas guesses column types on load, and "looks numeric" wins.

python

import pandas as pd

# fix: force the column to stay text
df = pd.read_csv("registers.csv", dtype={"reference": str})

# or, when in doubt, read everything as text and convert on purpose
df = pd.read_csv("registers.csv", dtype=str)

For messy archival tables, reading everything as text first and converting individual columns yourself is the safest default — you stay in control of every type.

Why do I get a UnicodeDecodeError?

Historical CSVs are frequently not UTF-8. Exports from old databases are often Windows-1252 or Latin-1, and an accented name like Lefèvre triggers a decode error.

python

# try in this order
df = pd.read_csv("baptisms.csv", encoding="utf-8")
df = pd.read_csv("baptisms.csv", encoding="cp1252")   # Windows export
df = pd.read_csv("baptisms.csv", encoding="latin-1")  # last resort, never errors

latin-1 never raises an error because every byte is valid in it — which is exactly why you should confirm the text looks right afterwards, not just that it loaded.

Why is my numeric column suddenly all NaN?

If a column you expect to be numbers comes back as text or full of NaN, a single non-numeric value is to blame: unknown, c.1850, [illegible], or a footnote marker. Coerce and inspect:

python

df["amount_num"] = pd.to_numeric(df["amount"], errors="coerce")

# which original values failed to convert?
bad = df.loc[df["amount_num"].isna() & df["amount"].notna(), "amount"]
print(bad.value_counts())

This surfaces the exact troublesome entries so you can decide whether to clean, flag, or exclude them rather than discarding the column.

What does SettingWithCopyWarning actually mean?

It warns that you may be modifying a temporary slice instead of the real DataFrame, so your change might silently vanish. The reliable fix is .loc:

python

# warns and may not stick
df[df["county"] == "Kent"]["region"] = "South East"

# correct
df.loc[df["county"] == "Kent", "region"] = "South East"

If you genuinely want a separate working copy, take one explicitly with df.copy().

How do you handle uncertain historical dates?

Never overwrite the source. Keep the original text and parse into a new column, coercing failures and flagging uncertainty:

python

df["date_parsed"] = pd.to_datetime(df["date_text"], errors="coerce")
df["date_uncertain"] = df["date_text"].str.contains(r"c\.|\?|circa", case=False, na=False)

This preserves circa and partial dates as historical evidence rather than throwing them away to satisfy a date parser.

Why does my merge produce duplicates or drop rows?

Merges go wrong on the join key. Three culprits dominate:

Symptom	Likely cause	Fix
Rows vanish	Trailing spaces / case mismatch	`.str.strip().str.lower()` both keys
Row count explodes	One-to-many relationship	Check `value_counts()` on each key
Unexpected NaN columns	Key present in one table only	Use `indicator=True` to inspect

python

for col in ["surname"]:
    df[col] = df[col].str.strip().str.lower()

merged = df.merge(parishes, on="surname", how="left", indicator=True)
print(merged["_merge"].value_counts())

The indicator=True column tells you exactly how many rows matched on both sides, left only, or right only — the single most useful merge debugging tool.

Key Takeaways

Read messy archival data with dtype=str first, then convert columns deliberately.
Force identifier columns to text to keep leading zeros and slashes intact.
Fix encoding errors with cp1252 or latin-1, then verify the text looks correct.
Use pd.to_numeric(..., errors='coerce') and inspect the resulting NaNs to find bad values.
Assign with .loc to silence and avoid SettingWithCopyWarning.
Preserve original date text and flag uncertainty instead of discarding circa dates.
Debug merges with normalised keys and indicator=True.

Frequently Asked Questions

Why does pandas turn my catalogue references into numbers and lose the leading zeros?

pandas infers column types and reads something like 00123 as the integer 123. Force the column to text by passing dtype={'ref': str} to read_csv, or read everything as strings with dtype=str and convert deliberately.

Why do I get a UnicodeDecodeError reading a historical CSV?

The file is not UTF-8 — older exports are often Windows-1252 or Latin-1. Pass encoding='latin-1' or encoding='cp1252' to read_csv, and ideally re-save the source as UTF-8 once you know it reads correctly.

What does SettingWithCopyWarning mean and how do I fix it?

It warns that you may be assigning to a temporary slice rather than the original DataFrame. Fix it by using .loc for assignment, e.g. df.loc[mask, 'col'] = value, or by taking an explicit .copy() when you mean to.

Why are my numeric columns full of NaN after loading?

A non-numeric value such as 'unknown', 'c.1850' or a stray footnote marker forced the column to text. Use pd.to_numeric(col, errors='coerce') and inspect which rows became NaN to find the offending entries.

How do I deal with missing or uncertain dates in pandas?

Keep an original text column untouched and parse into a separate datetime column with errors='coerce'. Add a boolean 'date_uncertain' flag rather than silently discarding circa or partial dates.

Why does merging two historical tables produce duplicate or missing rows?

Usually a key mismatch: trailing spaces, inconsistent casing, or a one-to-many relationship. Strip and normalise keys first, check value_counts on the key, and inspect the merge indicator to see which rows matched.

Why did my catalogue references lose their leading zeros? ​

Why do I get a UnicodeDecodeError? ​

Why is my numeric column suddenly all NaN? ​

What does SettingWithCopyWarning actually mean? ​

How do you handle uncertain historical dates? ​

Why does my merge produce duplicates or drop rows? ​

Key Takeaways ​

Frequently Asked Questions ​

Why does pandas turn my catalogue references into numbers and lose the leading zeros? ​

Why do I get a UnicodeDecodeError reading a historical CSV? ​

What does SettingWithCopyWarning mean and how do I fix it? ​

Why are my numeric columns full of NaN after loading? ​

How do I deal with missing or uncertain dates in pandas? ​

Why does merging two historical tables produce duplicate or missing rows? ​

Related reading ​

Why did my catalogue references lose their leading zeros?

Why do I get a UnicodeDecodeError?

Why is my numeric column suddenly all NaN?

What does SettingWithCopyWarning actually mean?

How do you handle uncertain historical dates?

Why does my merge produce duplicates or drop rows?

Key Takeaways

Frequently Asked Questions

Why does pandas turn my catalogue references into numbers and lose the leading zeros?

Why do I get a UnicodeDecodeError reading a historical CSV?

What does SettingWithCopyWarning mean and how do I fix it?

Why are my numeric columns full of NaN after loading?

How do I deal with missing or uncertain dates in pandas?

Why does merging two historical tables produce duplicate or missing rows?

Related reading