When to Run regression on historical data

Run regression on historical data when you have a clearly defined outcome, two or more plausible explanatory variables, and enough clean observations to estimate them — typically dozens of rows per predictor. Regression earns its place when you need to describe a pattern while adjusting for confounders, not when you simply want to show a trend that a chart conveys better. The hard part is honesty about what an observational coefficient can and cannot claim.

What question is regression actually good for?

Regression answers: holding the other measured factors constant, how does the outcome change with this variable? "Controlling for age and parish, were literate men paid more?" is a regression question. "Did wages rise over the century?" is usually a description question better served by a plotted series. Match the tool to the question before fitting anything.

How much and how good must my data be?

Three thresholds gate the decision:

Sample size. Aim for 10 to 20 observations per predictor. Five predictors want roughly 50 to 100 rows at the very least.
Variation. A predictor with almost no variation (everyone in the same occupation) cannot explain anything.
Measurement quality. Noisy or systematically biased variables attenuate or distort coefficients. Garbage in, confident garbage out.

If any of these fail, stop and improve the data or change the question.

Which model family fits my outcome?

Choosing ordinary least squares for everything is a frequent error. The outcome's type dictates the family.

Outcome	Example	Model
Continuous	Wage in pence	Linear (OLS)
Binary	Literate yes/no	Logistic
Count	Children in household	Poisson / negative binomial
Proportion	Share of land owned	Beta / fractional logit
Time-to-event	Age at death	Cox / survival

# binary outcome -> logistic, not OLS
m <- glm(literate ~ wage + age + factor(parish),
         data = census, family = binomial())
summary(m)
exp(coef(m))   # odds ratios are easier to read

Can a coefficient prove causation?

Almost never from observational history alone. Your sources were not randomised; confounders and selection lurk everywhere. A positive wage-literacy coefficient might reflect that wealthier families both schooled their children and secured better jobs. Treat coefficients as associations unless you bring a design — difference-in-differences around a reform, an instrumental variable, regression discontinuity — plus a written causal argument.

What pitfalls bite historical regressions specifically?

Autocorrelation in annual series inflates significance; use Newey-West or HAC standard errors.
Changing definitions over time (a "household" or an "occupation" redefined mid-series) silently break comparability.
Survivorship. You model the records that survived, which over-represent the literate and the propertied.
Aggregation. Parish-level averages can show a relationship that vanishes for individuals — the ecological fallacy.

When should I not run regression?

Skip it when the sample is tiny, the data are visibly biased toward one group, the variables are too aggregated, or a single cross-tab already answers the question cleanly. A two-by-two table that shows the pattern beats a fragile model that hides it behind a p-value. Description is a legitimate, often superior, endpoint.

How do I report it responsibly?

State the model family, sample size, what you controlled for, the standard-error treatment, and the limitations on causal interpretation. Show the coefficient with a confidence interval, not just a star. Run one robustness check — drop an influential decade, or swap a control — and report whether the result holds.

Key Takeaways

Use regression to adjust for confounders, not to redraw a trend a chart shows better.
Aim for 10 to 20 observations per predictor and ensure real variation.
Match the model family to the outcome: logistic for binary, Poisson for counts.
Observational coefficients are associations; causation needs a design.
Correct for autocorrelation in time series with HAC standard errors.
Watch survivorship, changing definitions, and the ecological fallacy.
Skip regression when data are too thin or a simple table answers the question.

Frequently Asked Questions

When is regression appropriate for historical data?

Regression fits when you have a measurable outcome, several plausible explanatory variables, and enough observations to estimate them. It is most useful for describing patterns and adjusting for confounders, less so for proving causation from observational records.

How many observations do I need?

A common rule of thumb is at least 10 to 20 observations per predictor for a stable linear model. With five predictors aim for 50 to 100 rows minimum, and more if effects are small or data are noisy.

Can regression prove causation in history?

Rarely on its own. Observational historical data carry confounding and selection, so a coefficient is an association. Causal claims need a design such as difference-in-differences or an instrument plus a defensible argument.

What if my outcome is a count or a yes/no?

Use the right model family: logistic regression for binary outcomes like literate or not, and Poisson or negative binomial for counts like number of children. Ordinary least squares is for continuous outcomes.

Why is autocorrelation a problem in time series?

Successive years are correlated, so ordinary standard errors are too small and significance is overstated. Use Newey-West standard errors or model the time structure explicitly.

When should I not run regression at all?

Skip it when your data are too few, too biased, or too aggregated to support it, or when a clear table or chart already answers the question. A misleading model is worse than honest description.

What question is regression actually good for? ​

How much and how good must my data be? ​

Which model family fits my outcome? ​

Can a coefficient prove causation? ​

What pitfalls bite historical regressions specifically? ​

When should I not run regression? ​

How do I report it responsibly? ​

Key Takeaways ​

Frequently Asked Questions ​

When is regression appropriate for historical data? ​

How many observations do I need? ​

Can regression prove causation in history? ​

What if my outcome is a count or a yes/no? ​

Why is autocorrelation a problem in time series? ​

When should I not run regression at all? ​

Related reading ​