Design a balanced corpus: A Practical Guide

Design a balanced corpus by first naming the dimensions that matter to your research question - typically date, genre, region and author - then sampling deliberately across them so no single category dominates by accident of survival or digitisation. The core workflow is four steps: define the population, build a sampling frame, set target proportions per stratum, and fill those strata while logging the gap between target and actual. Balance is a design decision made before collection, not a fix applied afterwards.

What does a "balanced" corpus actually mean?

Balance means your corpus mirrors the population you want to describe along the axes relevant to your question. A corpus of eighteenth-century pamphlets that is 80 percent London imprints will tell you about London, not the country, no matter how large it is. The enemy is convenience sampling: grabbing whatever is already digitised, which over-represents popular, well-preserved, and easily-scanned material. Balance is what separates a corpus you can generalise from a pile of files.

How do I define the population and sampling frame?

Start by writing down the target population in one sentence, then enumerate the strata it divides into. A sampling frame is the explicit list of those cells:

text

Population: English-language sermons, 1640-1700
Strata:
  date     -> decades: 1640s, 1650s, 1660s, 1670s, 1680s, 1690s
  region   -> London, provincial England, Scotland
  author   -> conformist, nonconformist
Cells = 6 x 3 x 2 = 36 strata to fill

Now every document has a home, and any empty cell is visible rather than hidden.

Should strata be equal-sized or proportional?

This is the central design fork.

Design	Each stratum sized to	Use when
Balanced	equal counts	comparing categories fairly
Representative	real-world proportions	estimating overall prevalence

If you ask "did conformist and nonconformist sermons differ in vocabulary", equal sizes give each side a fair voice. If you ask "what did the average sermon look like", proportional sizing reflects what readers actually encountered. State which you chose and why.

How big should each stratum be?

Big enough that the numbers you care about stop wobbling as you add text. For word-frequency work that usually means tens of thousands of tokens per cell. A cheap pilot tells you when you have enough:

python

import pandas as pd

freqs = []
for n in [5_000, 10_000, 20_000, 40_000]:
    sample = corpus.sample(n=n, random_state=1)
    freqs.append(sample.text.str.contains("grace").mean())

print(pd.Series(freqs, index=[5_000, 10_000, 20_000, 40_000]))

When the proportion stabilises across sample sizes, the stratum is large enough; if it still jumps, keep adding.

How do I handle survivorship and digitisation bias?

You cannot un-burn the records that did not survive, but you can stop the survivors from speaking for everyone. Three practical moves: document the known gaps openly; stratify so an over-surviving category cannot dominate by sheer volume; and apply weights at analysis time if rebuilding is impractical. Crucially, frame every conclusion as applying to surviving, digitised material, because that is genuinely all the corpus contains.

Can I rebalance a corpus I already have?

Often, yes, and it is cheaper than rebuilding. Two options: sub-sample the over-represented strata down to the size of the smallest comparable cell, which discards data but simplifies analysis; or attach a weight to each document so under-represented strata count for more, which keeps all the text but pushes the complexity into your statistics. Weighting is reversible and loses nothing, so prefer it unless a tool you must use cannot handle weights.

How do I document the design so others trust it?

Publish a corpus composition statement: a table listing each stratum with its target and actual token count, plus a short note on what is missing and why. This single artefact lets a reviewer see at a glance whether your "decline over the 1680s" rests on a well-filled cell or on three surviving pamphlets. Transparency about composition is what turns a private dataset into a citable research instrument.

Key Takeaways

Decide balance during design by naming the dimensions that matter, not afterwards.
Build an explicit sampling frame so every document has a stratum and gaps are visible.
Choose equal-sized strata to compare categories, proportional sizes to estimate prevalence.
Pilot each stratum until key frequencies stop moving as you add text.
Counter survivorship bias by documenting gaps, stratifying, and weighting, not by pretending it is gone.
Publish a composition table of target versus actual counts so others can judge your claims.

Frequently Asked Questions

What does 'balanced' mean for a corpus?

A balanced corpus represents its target population proportionally across the dimensions that matter to your question, such as date, genre, region or author, rather than over-representing whatever happened to survive or be easiest to digitise.

Should every category have the same number of words?

Not necessarily. Equal sizes (a balanced design) suit comparison across categories, while sizes proportional to the real population (a representative design) suit estimating overall trends. Choose based on whether you are comparing groups or describing a whole.

How do I correct for survivorship bias?

Document what is missing, weight or stratify your sample so over-surviving categories do not dominate, and state explicitly that conclusions apply to surviving material. You cannot recover lost sources, but you can stop them from silently skewing results.

How big should each stratum be?

Large enough that frequency estimates are stable, which for word-level analysis often means tens of thousands of tokens per cell. Pilot with a small sample and check whether key frequencies move much as you add more text.

Can I rebalance an existing corpus instead of rebuilding it?

Yes, by sub-sampling over-represented strata down or by applying statistical weights at analysis time. Weighting is cheaper and reversible, while sub-sampling discards data but keeps later analysis simpler.

How do I document the design so others trust it?

Publish a sampling frame and a table of strata with target and actual counts, plus a note on known gaps. This 'corpus composition' statement lets reviewers judge what your results can and cannot support.

What does a "balanced" corpus actually mean? ​

How do I define the population and sampling frame? ​

Should strata be equal-sized or proportional? ​

How big should each stratum be? ​

How do I handle survivorship and digitisation bias? ​

Can I rebalance a corpus I already have? ​

How do I document the design so others trust it? ​

Key Takeaways ​

Frequently Asked Questions ​

What does 'balanced' mean for a corpus? ​

Should every category have the same number of words? ​

How do I correct for survivorship bias? ​

How big should each stratum be? ​

Can I rebalance an existing corpus instead of rebuilding it? ​

How do I document the design so others trust it? ​

Related reading ​

What does a "balanced" corpus actually mean?

How do I define the population and sampling frame?

Should strata be equal-sized or proportional?

How big should each stratum be?

How do I handle survivorship and digitisation bias?

Can I rebalance a corpus I already have?

How do I document the design so others trust it?

Key Takeaways

Frequently Asked Questions

What does 'balanced' mean for a corpus?

Should every category have the same number of words?

How do I correct for survivorship bias?

How big should each stratum be?

Can I rebalance an existing corpus instead of rebuilding it?

How do I document the design so others trust it?

Related reading