Troubleshooting: Do cohort analysis in history

When cohort analysis in history goes wrong, the cause is almost always one of four things: migration leaking people out of your observation window, survivorship bias in which records survived, ragged age boundaries between sources, or the unfixable age-period-cohort identification problem. The fix is rarely more data; it is making your assumptions explicit and bounding the error. This guide walks through each failure, how to diagnose it fast, and the correction that actually holds up to peer review.

Why are my cohort sizes shrinking faster than mortality alone explains?

A cohort that drops from 400 to 280 people across two censuses has not necessarily lost 120 to death. Out-migration and record loss subtract people too. Diagnose it by checking whether the place also gained migrants of the same age in the same period; a net-zero migration assumption is almost never true for towns. Fix it by working with the most closed populations you can find, parishes with strict settlement laws, islands, or by explicitly modelling migration as a separate flow rather than folding it into your survival rate. Report a survival rate only after you have either bounded or modelled the migration component.

How do I separate age, period and cohort effects?

You cannot, fully, and pretending otherwise is the most common cohort error. Age, period and cohort are linearly dependent: period = age + cohort. Any one is determined by the other two, so a model has to constrain one of them with an outside assumption. Use an age-period-cohort (APC) framework but treat the identifying constraint as a choice you report, not a fact you discover.

# APC with an intrinsic estimator, reported as a range
library(apc)
m <- apc.fit.model(data, model.family = "poisson", model.design = "APC")
# inspect sensitivity to the identifying constraint
apc.plot.fit(m)

Run the model under two or three plausible constraints and present the spread. A single APC point estimate hides the ambiguity rather than resolving it.

Why do my age bands not line up across sources?

One census uses 0-4, 5-9, 10-14; another uses 0-9, 10-19. Forcing the finer onto the coarser by splitting evenly fabricates precision. The reliable fix is to aggregate up to the coarsest common band for any cross-source comparison.

problem	bad fix	reliable fix
5y vs 10y bands	split 10y in half	aggregate both to 10y
open-ended "60+"	assume a cap	keep open, report separately
single-year vs banded	interpolate	band the single-year data

You lose granularity, but you stop inventing it, which matters more for credibility.

What about survivorship bias in the surviving records?

People who died young or emigrated leave fewer records, so your traceable cohort is a biased remnant skewed toward survivors and stayers. Diagnose by comparing your linked sample's age distribution to an independent aggregate total for the same cohort. If your sample is older and more local than the aggregate, you have measurable bias. Bound it by reporting results as a range between "all losses were random" and "all losses were the most vulnerable," which brackets the truth.

How do I handle a famine, epidemic or war inside a cohort?

A single survival rate across a 1693-1697 dearth averages a normal regime with a catastrophic one and erases the event. Split the cohort at the disruption and analyse sub-periods separately. In any chart, annotate the break with a vertical rule and a label, so the reader sees the regime change rather than a misleadingly smooth decline.

When is a synthetic cohort acceptable?

When you have cross-sectional age data but no longitudinal links, a synthetic cohort, reading survival diagonally across successive censuses, is a legitimate standard tool. Its assumption is that conditions were stable across the periods you stitch together. State that assumption plainly and never build a synthetic cohort across a known shock, because the stability premise fails exactly where it matters most.

A fast triage routine

Plot raw cohort size by period; a kink usually means migration or a disruption, not mortality.
Cross-check totals against an independent aggregate to catch survivorship bias.
Harmonise age bands to the coarsest common width before comparing.
Run any APC model under multiple constraints and report the range.
Split cohorts at any documented shock.

Key Takeaways

Shrinking cohorts reflect migration and record loss, not just mortality; bound that component before reporting survival.
Age, period and cohort are linearly dependent; report APC results as a range under stated constraints.
Harmonise ragged age bands upward to the coarsest common width rather than fabricating precision.
Survivorship bias makes traced cohorts unrepresentative; benchmark against independent aggregates.
Split cohorts at famines, epidemics and wars instead of averaging across regimes.
Synthetic cohorts are valid only where conditions were stable across the stitched periods.

Frequently Asked Questions

Why do my cohort sizes shrink unexpectedly between censuses?

Usually because of out-migration and record loss, not just mortality, so a naive cohort survival rate is contaminated by people leaving your observation window. Bound the effect by checking in-migration to the same place and treating closed populations (islands, parishes with strict settlement) as your cleanest cases.

How do I tell an age effect from a cohort effect?

You cannot fully separate age, period and cohort from a single dataset because they are linearly dependent; one must be constrained by external assumption. Use an age-period-cohort model with a documented identifying constraint, and report results as a sensitivity range, not a single point.

My cohorts have ragged age boundaries across sources. What do I do?

Harmonise to the coarsest common age band rather than inventing precision your sources lack. If one census uses five-year bands and another uses ten-year bands, aggregate up to ten-year bands for the comparison and note the loss of granularity.

Why are early-life cohorts overrepresented in my surviving records?

Survivorship bias: records of people who died young or migrated are lost at higher rates, so the people you can still trace are an unrepresentative remnant. Estimate the bias by comparing your traced sample to known aggregate totals for the same cohort where they exist.

How should I handle a cohort that straddles a major disruption like a famine or war?

Split the cohort at the disruption and analyse the sub-periods separately, because a single survival rate averages across radically different mortality regimes and hides the event you most want to see. Annotate the break explicitly in any chart.

Is a synthetic cohort acceptable when I lack true longitudinal data?

Yes, a synthetic cohort built from cross-sectional age data is standard when longitudinal records are missing, but it assumes conditions were stable across the periods you stitch together. State that assumption and avoid synthetic cohorts across known shocks.

Why are my cohort sizes shrinking faster than mortality alone explains? ​

How do I separate age, period and cohort effects? ​

Why do my age bands not line up across sources? ​

What about survivorship bias in the surviving records? ​

How do I handle a famine, epidemic or war inside a cohort? ​

When is a synthetic cohort acceptable? ​

A fast triage routine ​

Key Takeaways ​

Frequently Asked Questions ​

Why do my cohort sizes shrink unexpectedly between censuses? ​

How do I tell an age effect from a cohort effect? ​

My cohorts have ragged age boundaries across sources. What do I do? ​

Why are early-life cohorts overrepresented in my surviving records? ​

How should I handle a cohort that straddles a major disruption like a famine or war? ​

Is a synthetic cohort acceptable when I lack true longitudinal data? ​

Related reading ​