Appearance
The biggest cultural analytics pitfall is mistaking the shape of your archive for the shape of the past — your corpus is filtered three times over by survival, by what was kept, and by what got digitised. The other recurring traps are systematic OCR bias, raw instead of relative counts, and over-claiming causation from correlation. Avoiding them is mostly about documenting what is missing and refusing to let a tidy chart hide its assumptions. None of this requires advanced statistics; it requires honesty about the data.
Cultural analytics applies computational methods to large cultural collections. Done well it reveals patterns invisible to close reading; done carelessly it produces confident charts that measure the archive's biases rather than history. This guide names the traps and shows a small worked example of catching one.
Why is my corpus not a neutral sample of the past?
Because three filters sit between the past and your files:
- Survival — fragile, cheap and everyday material is lost at far higher rates.
- Selection — archivists and collectors kept what they judged valuable.
- Digitisation — institutions digitise the popular, the fundable and the legible first.
Each filter favours the elite, the literate and the prestige object. A frequency chart drawn from such a corpus describes what survived and was scanned, not what people wrote. Name these gaps in every write-up.
Why does OCR quality bias my results?
OCR errors are not random scatter — they are structured. Gothic type, non-Latin scripts, tightly bound volumes and damaged pages fail systematically. So whole genres are under-represented in the machine-readable text, even when the page images exist.
python
# A cheap OCR-health check before any analysis:
def garbage_ratio(text):
tokens = text.split()
junk = sum(1 for t in tokens if len(t) == 1 or not t.isalpha())
return junk / max(len(tokens), 1)
# Flag documents above ~0.25 for manual review before counting.A document with a 40% garbage ratio should not silently contribute to your word counts. Measure OCR health, segment your corpus by it, and check whether your finding survives in the clean subset.
What is survivorship bias, in plain terms?
Survivorship bias is drawing conclusions from the things that lasted while ignoring the things that did not. In cultural collections it means inferring "what people read" from surviving books — but cheap broadsides and personal letters survived far less than leather-bound volumes. The pattern you see is the pattern of preservation. The fix is not technical: it is acknowledging the gap and tempering your claims.
A small worked example: spotting a false trend
Suppose you count the word machine per decade and see a sharp rise after 1850. Before celebrating an industrial-revolution signal, check the denominators:
| Decade | machine count | Total tokens | Rate /10k |
|---|---|---|---|
| 1830s | 40 | 2,000,000 | 0.20 |
| 1840s | 55 | 2,400,000 | 0.23 |
| 1850s | 600 | 3,000,000 | 2.00 |
| 1860s | 90 | 9,000,000 | 0.10 |
The 1850s "spike" comes from one over-digitised engineering journal; the 1860s "collapse" is just a much larger, more varied corpus. The raw counts told a story; the rates and the denominators dissolved it. Always show the token totals.
How do I report findings without over-claiming?
- Show sample sizes behind every figure, not just the headline rate.
- Use correlational language unless you have a design that supports cause.
- State the corpus's known gaps up front, not in a buried footnote.
- Test whether the result survives a change in tokenisation or cleaning.
With large corpora, statistical significance is nearly automatic; argue from effect size and robustness instead.
Key Takeaways
- Your corpus is filtered by survival, selection and digitisation — never treat it as neutral.
- OCR errors are systematic, biasing whole genres; measure OCR health before counting.
- Survivorship bias over-represents elite and prestige material.
- Relative frequency fixes corpus size, not representativeness — they are separate issues.
- Always publish the denominators; a spike can be one over-digitised source.
- With big corpora, judge effect size and robustness, not statistical significance.
Frequently Asked Questions
What is the single most common cultural analytics mistake?
Mistaking the shape of your archive for the shape of the past. Your corpus is what survived, was kept, and got digitised — three filters that bias results. Treating it as a neutral sample of history is the root error.
Why is OCR quality a methodological problem, not just a nuisance?
Because OCR errors are systematic, not random: certain fonts, languages and damaged pages fail more often. That means whole categories of source are quietly under-counted, biasing every frequency and model built on the text.
What is survivorship bias in a corpus?
It is the distortion from analysing only what survived. Cheap pamphlets, working-class letters and ephemeral print survive far less than prestige books, so a corpus over-represents the literate and the elite.
Does normalising frequencies fix sampling bias?
No. Relative frequency corrects for differing corpus sizes per period, but it cannot correct for what is missing or over-represented in the first place. Normalisation and representativeness are separate problems.
How do I report uncertainty honestly?
State your corpus's known gaps and biases, show sample sizes behind every figure, and avoid causal language the data cannot support. A chart without its denominators and caveats over-claims by default.
Is a statistically significant result a meaningful one?
Not necessarily. With large corpora almost any difference becomes statistically significant. Ask whether the effect size is large enough to matter historically, and whether it survives a change in your preprocessing choices.