Appearance
Statistical significance tells you whether a pattern in your historical data is unlikely to be a fluke of chance, not whether it is large or historically meaningful. A significant result with a tiny effect can be unimportant, and an important effect can fail to reach significance in a small sample. So the honest reading of any historical finding pairs a significance test with an effect size and a confidence interval. This guide explains those ideas in plain language with a small worked example you can follow.
What is a p-value, in plain words?
Imagine you suspect literacy rates differed between two parishes. The null hypothesis is the boring assumption that there is no real difference. A p-value answers: if the null were true, how often would I see a gap at least this big purely by chance? A p-value of 0.03 means "about 3 times in 100." By convention, below 0.05 we call the result statistically significant, meaning surprising enough under the null to take seriously.
What a p-value is not: it is not the probability the null is true, and it is not a measure of how big or important the effect is. Those are the two misreadings that cause the most trouble.
A small worked example
Two parishes, signatures on marriage registers as a literacy proxy:
| parish | literate | total | rate |
|---|---|---|---|
| Ashby | 62 | 100 | 62% |
| Barton | 48 | 100 | 48% |
A two-proportion test asks whether a 14-point gap is surprising under "no real difference":
python
from statsmodels.stats.proportion import proportions_ztest
counts = [62, 48] # literate
nobs = [100, 100] # totals
stat, p = proportions_ztest(counts, nobs)
print(round(p, 4)) # ~0.0466 -> just under 0.05It scrapes under 0.05, so it is "significant", but with 100 people per parish the result is fragile. Change a handful of records and it flips. That fragility is the real story, and a p-value alone hides it.
Why should I report effect size and confidence intervals too?
The effect size here is the 14-percentage-point difference, that is what a historian actually cares about. The confidence interval shows the range of differences compatible with the data. For this example the 95% interval runs roughly from 0 to 28 points: real, but anywhere from negligible to substantial. Reporting "14 points, 95% CI 0-28" is far more honest and informative than "p < 0.05," because it shows both the estimate and its uncertainty in the units of your question.
Does significance even apply to a complete census?
This is genuinely debated. A significance test imagines your data is a sample drawn from a larger population, and asks whether the pattern would generalise. If you hold the entire population, a complete parish register or full census, there is no larger population to generalise to, so the classic test answers a question you may not be asking. Many historians in that case drop the p-value entirely and report effect sizes as descriptions of what demonstrably happened.
Why are small samples so treacherous?
Most historical samples are small, and small samples behave badly in two ways. They have wide confidence intervals, so genuine effects often fail to reach significance (a false negative). And among the results that do reach significance, effect sizes are systematically exaggerated, because only the larger random swings cross the threshold. The remedy is to lead with the confidence interval, which shows the whole plausible range, rather than a yes/no significance verdict.
What about running lots of tests?
If you test twenty parish pairs, each at the 0.05 level, you expect about one false positive by chance even if nothing is real. This is the multiple comparisons problem. Either pre-register the handful of comparisons you actually care about, or apply a correction:
python
from statsmodels.stats.multitest import multipletests
pvals = [0.047, 0.21, 0.003, 0.18, 0.34]
reject, p_adj, _, _ = multipletests(pvals, method="holm")Holm or Benjamini-Hochberg adjustments keep your false-positive rate honest when you fish through many comparisons.
How should I phrase a finding responsibly?
State the effect size in plain units, attach a confidence interval, and only then mention significance. "Literacy was 14 points higher in Ashby (95% CI 0-28; p = 0.047), a difference that is suggestive but imprecise given the small registers" tells the reader the size, the uncertainty and the caution all at once. That sentence is defensible; "the difference was significant" is not.
Key Takeaways
- Significance means "unlikely by chance," not "large" or "important."
- A p-value is not the probability the null is true, nor a measure of effect size.
- Always report effect size and a confidence interval alongside any p-value.
- On a complete population a significance test may answer the wrong question; describe effect sizes instead.
- Small samples give wide intervals and exaggerated significant effects; lead with the interval.
- Correct for multiple comparisons or pre-register the few tests you care about.
Frequently Asked Questions
What does statistical significance actually mean?
It means a result would be unlikely to arise by chance alone if there were really no effect, where "unlikely" is set by your threshold, usually a p-value below 0.05. It says nothing about whether the effect is large or historically important.
Does a low p-value mean my finding matters?
No. A p-value measures how surprising the data is under the no-effect assumption, not how big or meaningful the effect is. A trivial difference can be statistically significant in a huge sample, and a large, important difference can be non-significant in a small one.
Can I even use significance tests on a full population, like a complete census?
It is debated. If you have the entire population there is no sampling to generalise from, so a classic significance test answers a question you may not be asking. Many historians instead report effect sizes and treat any variation as a description, not an inference.
What is the difference between significance and effect size?
Significance asks whether an effect is distinguishable from zero; effect size asks how big it is. Always report both, because significance without effect size hides whether the finding is trivial or substantial.
Why do small historical samples make significance unreliable?
Small samples have wide confidence intervals, so they often fail to reach significance even when a real effect exists, and the significant results that do appear are prone to exaggerated effect sizes. Prefer reporting the confidence interval, which shows the full range of plausible values.
Should I correct for multiple comparisons?
Yes, if you run many tests, because each one has its own chance of a false positive and those chances accumulate. Use a correction like Holm or Benjamini-Hochberg, or simply pre-register the few comparisons you truly care about.