Troubleshooting: Handle missing historical data

When you hit missing historical data, the first move is diagnosis, not repair: figure out why the value is absent before deciding how to fill it. Classify the gap as MCAR, MAR, or MNAR, distinguish true missingness from structural zeros and "not applicable", then choose listwise deletion, multiple imputation, or an explicit missing-indicator accordingly. Reaching for mean imputation by reflex is the single most common error and it quietly biases everything downstream.

Why is the value missing in the first place?

Historical gaps are rarely random. A wage left blank because the labourer was unemployed (MNAR — the gap depends on the unobserved value) is a completely different problem from a folio lost to fire (closer to MCAR). Before any fix, ask: did the recorder omit this deliberately, was it lost, or was it never applicable? The answer dictates the method.

text

MCAR : gap unrelated to anything       -> deletion is unbiased (rare)
MAR  : gap depends on observed fields  -> imputation works
MNAR : gap depends on the missing value-> needs modelling assumptions

How do I diagnose the mechanism quickly?

Build a missingness map and test whether missingness on one variable correlates with observed values on others.

library(naniar)
vis_miss(census)                 # heatmap of gaps per column
mcar_test(census)                # Little's MCAR test
gg_miss_upset(census)            # which gaps co-occur

If missingness in wage correlates with occupation, you have evidence for MAR, which is the friendliest case to fix.

What is the wrong fix that everyone reaches for?

Mean (or mode) imputation. It plugs the gap with a single number, collapsing variance and inventing certainty that is not there. After mean-imputing 30 percent of a wage column, your standard errors are too small and your regression coefficients drift toward zero. Never do it for analysis variables.

Method	Safe when	Cost	Verdict
Listwise deletion	MCAR only	Lost sample	Use sparingly
Mean / mode fill	Almost never	Biased variance	Avoid
Missing indicator	Categorical, MNAR	Extra category	Good for analysis
Multiple imputation	MAR	Compute, complexity	Best general fix

When should I use multiple imputation?

Use it when data are plausibly MAR and the loss is substantial. Multiple imputation fills the gaps several times from a model, runs your analysis on each completed dataset, and pools the results so the extra uncertainty shows up in wider, honest confidence intervals.

library(mice)
imp <- mice(census, m = 20, method = "pmm", seed = 42)
fit <- with(imp, lm(literate ~ wage + age + sex))
pool(fit)                        # Rubin's rules combine the 20 fits

Predictive mean matching (pmm) draws fills from real observed values, which keeps imputed wages plausible rather than inventing impossible figures.

How do I separate "missing" from "zero" and "not applicable"?

This trips up household data constantly. "Children in household: 0" is a structural zero; a blank is unknown; "—" might mean not applicable. If you treat all three as one, your averages are wrong. Use distinct codes (NA, 0, not_applicable) and document them in a data dictionary so nothing gets silently averaged together.

What about a missing-indicator approach for categoricals?

For categorical predictors where the absence itself is informative — say, occupation unrecorded for people on poor relief — add an explicit "Unknown" level rather than imputing. The model then estimates an effect for being unrecorded, which is often a finding in its own right.

How do I prove my handling didn't drive the result?

Run a sensitivity analysis: report the estimate under complete-case analysis and under imputation. If the substantive conclusion survives both, it is robust. If it flips, the missing data are doing the talking and you must say so plainly.

Key Takeaways

Diagnose the mechanism (MCAR, MAR, MNAR) before choosing any fix.
Listwise deletion is only unbiased under MCAR, which is rare in historical sources.
Avoid mean imputation; it shrinks variance and fabricates precision.
Prefer multiple imputation under MAR and pool results with Rubin's rules.
Code structural zeros, missing, and not-applicable with distinct sentinels.
Use an explicit "Unknown" category when absence is itself informative.
Always run and report a sensitivity check across methods.

Frequently Asked Questions

What are the three types of missing data?

Missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). The type determines which fixes are safe, and historical gaps are usually MAR or MNAR rather than MCAR.

Is it ever safe to just delete rows with gaps?

Listwise deletion is only unbiased under MCAR, which is rare in history. If the gaps relate to who or what is missing, deletion biases your estimates and shrinks your sample for no good reason.

When should I use multiple imputation?

Use multiple imputation when data are plausibly MAR and the missingness is substantial. It fills gaps several times to propagate uncertainty, unlike single-value fills that pretend the guess is certain.

How do I tell a structural zero from a missing value?

A structural zero means the quantity genuinely does not exist (no children in a childless household), while missing means it was not recorded. Code them with different sentinels so you never average them together.

Why is mean imputation a trap?

Replacing gaps with the column mean shrinks variance, distorts correlations, and fabricates false precision. It almost always biases later regression and should be avoided in favour of model-based methods.

How should I report missing data in a publication?

State the percentage missing per variable, the assumed mechanism, the method used, and a sensitivity check. Reviewers trust transparent handling far more than a silent, suspiciously complete table.

Why is the value missing in the first place? ​

How do I diagnose the mechanism quickly? ​

What is the wrong fix that everyone reaches for? ​

When should I use multiple imputation? ​

How do I separate "missing" from "zero" and "not applicable"? ​

What about a missing-indicator approach for categoricals? ​

How do I prove my handling didn't drive the result? ​

Key Takeaways ​

Frequently Asked Questions ​

What are the three types of missing data? ​

Is it ever safe to just delete rows with gaps? ​

When should I use multiple imputation? ​

How do I tell a structural zero from a missing value? ​

Why is mean imputation a trap? ​

How should I report missing data in a publication? ​

Related reading ​

Why is the value missing in the first place?

How do I diagnose the mechanism quickly?

What is the wrong fix that everyone reaches for?

When should I use multiple imputation?

How do I separate "missing" from "zero" and "not applicable"?

What about a missing-indicator approach for categoricals?

How do I prove my handling didn't drive the result?

Key Takeaways

Frequently Asked Questions

What are the three types of missing data?

Is it ever safe to just delete rows with gaps?

When should I use multiple imputation?

How do I tell a structural zero from a missing value?

Why is mean imputation a trap?

How should I report missing data in a publication?

Related reading