Beginner's Guide to Corpora for cultural analytics

To sample a corpus for cultural analytics, decide what population you're trying to represent, choose a sampling method (random or stratified) that matches that population, and record metadata for every item so you can check representativeness later. Sampling is not a chore that comes after corpus-building — it is the foundation that makes your later numbers mean anything. This guide explains the ideas in plain language and works a small example end to end.

What is a corpus, really?

A corpus is a deliberately chosen collection of texts plus metadata. The word "deliberately" is the whole game. A pile of whatever PDFs you happened to download is a convenience sample; it can illustrate but cannot support a claim like "anxiety rose across the period". A corpus is defined by the population it represents and the rule by which items were selected.

Why sample instead of using "everything"?

Two reasons. First, you usually can't get everything — much is unscanned, paywalled or lost. Second, and more subtly, what survives is already a biased sample of what once existed. Wealthy authors are over-represented; ephemeral working-class print is under-represented. Sampling deliberately doesn't remove this bias, but it lets you describe and reason about it instead of hiding it.

Random versus stratified sampling

The two core methods:

Random sampling — every item has an equal chance of selection. Simple and unbiased, but rare groups may be missed entirely.
Stratified sampling — split the population into strata (decades, genres, regions), then sample within each. This guarantees coverage of small but important groups.

For most historical work, stratify by time, because corpora are usually lopsided toward later, better-preserved decades.

python

import pandas as pd

catalogue = pd.read_csv("catalogue.csv")           # all available items
catalogue["decade"] = catalogue["year"] // 10 * 10

# take up to 40 items per decade, reproducibly
sample = (catalogue.groupby("decade", group_keys=False)
                    .apply(lambda g: g.sample(min(len(g), 40), random_state=1)))
sample.to_csv("sample.csv", index=False)
print(sample["decade"].value_counts().sort_index())

A small worked example

Say you have a catalogue of 6,000 pamphlets, 1700-1799, but 70% fall after 1770. A pure random sample of 400 would be dominated by the late century and tell you little about the 1700s. Stratifying by decade and taking 40 per decade gives 10 strata × 40 = 400 items with even temporal coverage — now a trend line across decades is interpretable.

Method	Coverage of rare decades	Bias toward dense periods	Reproducible
Convenience	Accidental	Severe	No
Random	Possibly none	Yes	With seed
Stratified by decade	Guaranteed	Controlled	With seed

What metadata do I need to record?

Record, for every item: a unique ID, date, source repository, language, and any variable you stratified on. Without date and source you cannot later check whether your sample mirrors the population. Store this as a flat CSV alongside the texts — your future self and reviewers will need it.

How do I check my sample is representative?

Compare the sample's distribution to the catalogue's on a couple of known variables (decade, language). If your sample is 25% French but the catalogue is 5% French, you've over-sampled French and must either re-weight in analysis or document the skew. Representativeness is always relative to the available population, never to "the truth" — survival bias sits underneath everything.

Key Takeaways

A corpus is a deliberately selected collection plus metadata, not a pile of convenient files.
What survived is already biased; sampling lets you reason about that bias openly.
Stratify by time for historical corpora, which skew toward better-preserved later decades.
Use a fixed random seed so anyone can reproduce your exact sample.
Record ID, date, source and language for every item — date and source are non-negotiable.
Check the sample's distribution against the catalogue and document any skew.
Convenience samples are fine for exploration but never for prevalence or trend claims.

Frequently Asked Questions

What is a corpus in cultural analytics?

A corpus is a structured collection of texts (or images) chosen to represent some population you want to study — a genre, a period, an author group — together with metadata describing each item.

Why sample at all instead of using everything?

Often you can't get everything, and what survives is already a biased sample of what once existed. Deliberate sampling lets you reason about that bias instead of pretending it isn't there.

What's the difference between random and stratified sampling?

Random sampling gives every item an equal chance. Stratified sampling first splits the population into groups (e.g. by decade) and samples within each, guaranteeing coverage of rare strata.

How big should my sample be?

Big enough that each subgroup you'll compare has enough items to be stable — often dozens per cell as a rough floor. Power depends on effect size, so pilot first.

What metadata do I need to record?

At minimum a unique ID, date, source/repository, language and any stratification variable. Without date and source you can't reason about representativeness later.

Is a convenience sample ever acceptable?

For exploration, yes, as long as you label it clearly. For any claim about prevalence or trend, a convenience sample undermines the conclusion.

What is a corpus, really? ​

Why sample instead of using "everything"? ​

Random versus stratified sampling ​

A small worked example ​

What metadata do I need to record? ​

How do I check my sample is representative? ​

Key Takeaways ​

Frequently Asked Questions ​

What is a corpus in cultural analytics? ​

Why sample at all instead of using everything? ​

What's the difference between random and stratified sampling? ​

How big should my sample be? ​

What metadata do I need to record? ​

Is a convenience sample ever acceptable? ​

Related reading ​

What is a corpus, really?

Why sample instead of using "everything"?

Random versus stratified sampling

A small worked example

What metadata do I need to record?

How do I check my sample is representative?

Key Takeaways

Frequently Asked Questions

What is a corpus in cultural analytics?

Why sample at all instead of using everything?

What's the difference between random and stratified sampling?

How big should my sample be?

What metadata do I need to record?

Is a convenience sample ever acceptable?

Related reading