When to Weight historical samples

Weight a historical sample only when two conditions both hold: its composition differs from the target population in ways tied to your outcome, and you have a credible external benchmark for the true population shares. If your sample already mirrors the population, or you have no trustworthy benchmark, weighting adds variance without removing bias and you should leave the data alone. The decision is a bias-variance trade, not a default step, and the rest of this guide gives you the signals to call it correctly.

What problem is weighting actually solving?

Weighting corrects compositional bias: your surviving records over- or under-represent some group relative to the historical population you want to describe. If urban parishes survive better than rural ones, and your outcome (literacy, wages, mortality) differs by setting, an unweighted average is skewed. Weighting rescales each record so the sample composition matches a known population, letting the under-represented groups count for more.

Crucially, weighting only touches bias you can measure. It cannot repair selection on variables you do not observe, which is the most dangerous kind in archival work.

When should I weight, and when should I not?

situation	weight?	why
Sample matches population on key variables	No	nothing to correct, only adds variance
Known oversampling of a rare group	Yes	design weight = inverse selection probability
Survivorship skews a measurable margin (region, age)	Yes	post-stratify to a census benchmark
Bias is on an unobserved variable	No	weighting cannot reach it; document the limit instead
Pure descriptive count of the archive itself	No	report what survives, as-is
Generalising to a wider population	Maybe	only with a trustworthy benchmark

The recurring theme: weighting requires a benchmark you trust more than your sample. Without one, you are guessing at the very numbers that drive the correction.

How do I compute and apply weights?

Post-stratification is the workhorse for historical data. For each stratum, the weight is the population share divided by the sample share:

python

import pandas as pd

# sample counts vs known population shares (e.g. from a census)
sample = df["region"].value_counts(normalize=True)
pop    = pd.Series({"urban": 0.35, "rural": 0.65})

weights = pop / sample            # per-stratum weight
df["w"] = df["region"].map(weights)

# weighted mean of an outcome
wmean = (df["literate"] * df["w"]).sum() / df["w"].sum()

Always normalise so weights average to 1; that keeps the effective sample size interpretable.

How do I know if my weights are too extreme?

Extreme weights destabilise estimates: a few records dominate and your standard errors balloon. Two diagnostics:

Max/min ratio. If the largest weight is more than roughly 10 times the smallest, a tiny group is being stretched to carry the population.
Design effect Deff = 1 + CV(w)^2, where CV is the coefficient of variation of the weights. Your effective sample size is n / Deff. A Deff of 2 means you have halved your usable sample.

python

cv = df["w"].std() / df["w"].mean()
deff = 1 + cv**2
n_eff = len(df) / deff

If n_eff collapses, trim or cap the weights (Winsorise at a percentile) and report it.

What does weighting cost you?

Variance. A weighted estimate is always less precise than an unweighted one of the same size, because the design effect inflates the standard error. You are buying lower bias with higher variance. The trade is worthwhile only when the bias you remove is both real and larger than the variance you add, which is exactly why you need a credible benchmark before committing.

Should I weight a descriptive count?

No. If you are simply reporting what survives in the archive, present it unweighted and label it as a description of the surviving records, not the historical population. Reserve weighting for the moment you explicitly generalise outward, and make that shift visible to the reader so they know which claim you are making.

A decision checklist

Are you describing the archive or generalising to a population? Only the latter may need weights.
Does sample composition differ from the population on outcome-related variables?
Do you have a benchmark you trust more than your sample?
After weighting, is n_eff still adequate and the max/min ratio reasonable?
Have you reported the weighting variables, source benchmark and design effect?

If any of 2-4 fails, do not weight; document the limitation instead.

Key Takeaways

Weight only when composition differs from the population in outcome-related ways and you have a trusted benchmark.
Weighting removes measurable bias only; it cannot fix selection on unobserved variables like which records survived.
Post-stratification weight = population share / sample share, normalised to mean 1.
Watch the design effect and max/min ratio; extreme weights shred your effective sample size.
Weighting trades lower bias for higher variance, so it must clear that bar to be worthwhile.
Report descriptive counts unweighted; weight only when generalising, and say which you are doing.

Frequently Asked Questions

When should I weight a historical sample at all?

Weight when your sample's composition differs from the population you want to describe in ways correlated with your outcome, and you have a trustworthy external benchmark for the true population shares. If either condition fails, weighting adds noise without removing bias.

What is the difference between design weights and post-stratification weights?

Design weights correct for known, deliberate sampling decisions such as oversampling a rare group, and are simply the inverse of selection probability. Post-stratification weights adjust an already-collected sample to match external population margins like age or region; historical work relies far more on the latter.

How do I know if my weights are too extreme?

Check the design effect and the ratio of the largest to smallest weight; a max/min ratio above roughly 10, or a handful of records carrying most of the total weight, signals instability. Trim or cap extreme weights and report that you did.

Can weighting fix a biased historical source?

Only the biases you can measure against a benchmark; weighting cannot correct for unmeasured selection, such as which records survived a fire. Weighting on age will not fix survivorship bias in literacy unless literacy bias is itself captured by your weighting variables.

Do I need to weight for a descriptive count or just for inference?

For a raw descriptive count of what is in your archive, no, report it as-is. Weight only when you generalise from the surviving sample to a wider historical population, and say clearly which one you are doing.

What happens to my standard errors when I weight?

Weighting almost always increases standard errors via the design effect, so a weighted estimate is less precise than an unweighted one of the same size. You trade variance for reduced bias, which is only worthwhile when the bias you remove is real and larger than the variance you add.

What problem is weighting actually solving? ​

When should I weight, and when should I not? ​

How do I compute and apply weights? ​

How do I know if my weights are too extreme? ​

What does weighting cost you? ​

Should I weight a descriptive count? ​

A decision checklist ​

Key Takeaways ​

Frequently Asked Questions ​

When should I weight a historical sample at all? ​

What is the difference between design weights and post-stratification weights? ​

How do I know if my weights are too extreme? ​

Can weighting fix a biased historical source? ​

Do I need to weight for a descriptive count or just for inference? ​

What happens to my standard errors when I weight? ​

Related reading ​

What problem is weighting actually solving?

When should I weight, and when should I not?

How do I compute and apply weights?

How do I know if my weights are too extreme?

What does weighting cost you?

Should I weight a descriptive count?

A decision checklist

Key Takeaways

Frequently Asked Questions

When should I weight a historical sample at all?

What is the difference between design weights and post-stratification weights?

How do I know if my weights are too extreme?

Can weighting fix a biased historical source?

Do I need to weight for a descriptive count or just for inference?

What happens to my standard errors when I weight?

Related reading