Appearance
To visualise cultural trends, normalise your counts into relative frequencies, bin them by a sensible time unit, plot a line with an honest uncertainty band, and always show corpus density (documents per bin) alongside so readers can spot composition artefacts. The most common mistake — plotting raw counts — produces charts that track the survival of text, not culture. Here is a clean step-by-step.
Step 1: Normalise before you plot anything
Raw word or document counts rise simply because more text survives from recent decades. Convert to a rate:
python
import pandas as pd
df = pd.read_parquet("hits.parquet") # one row per occurrence, with 'year'
totals = pd.read_csv("tokens_per_year.csv") # year, total_tokens
counts = df.groupby("year").size().rename("hits")
trend = (counts / totals.set_index("year")["total_tokens"] * 1_000_000)
trend.name = "per_million_tokens"Now the y-axis is "occurrences per million tokens" — comparable across periods of wildly different corpus size.
Step 2: Choose a time bin that matches your data
Yearly bins look precise but are noisy when sparse. If most years have under ~20 documents, bin by decade. The rule: each bin should hold enough documents that its estimate is stable. Show the binning choice in the caption.
How do I keep smoothing honest?
A rolling mean tames jitter but can manufacture smooth "trends" from noise. The honest pattern is: smooth and reveal.
python
ax = trend.rolling(window=3, center=True).mean().plot(label="3-bin rolling mean")
trend.plot(ax=ax, alpha=0.3, marker="o", linestyle="none", label="raw")
ax.legend()State the window length in words. A reader who can see the raw points knows how much to trust the line.
Step 3: Plot corpus density next to the trend
This single habit prevents most false discoveries. A spike in 1847 is exciting until you see that a 900-page periodical joined the corpus that year. Add a second panel:
| Panel | Shows | Catches |
|---|---|---|
| Trend (per-million) | Apparent cultural change | The "finding" |
| Documents per bin | Corpus composition | Artefacts masquerading as findings |
If the trend spike coincides with a density spike, treat it as suspect until proven otherwise.
Which chart type should I use?
- One or a few series → line chart, direct-labelled at the line ends rather than a legend.
- Many categories → small multiples (faceted lines on a shared axis), not one crowded chart.
- Composition over time → a stacked area only if absolute totals matter; for comparing individual trends, lines beat stacks every time.
Avoid dual y-axes — they let you imply correlations that aren't there.
Step 4: Show uncertainty
A bare line claims a precision your data rarely has. Bootstrap-resample documents within each bin and draw the 95% band:
python
import numpy as np
def band(values, n=1000):
boots = [np.mean(np.random.choice(values, len(values), replace=True))
for _ in range(n)]
return np.percentile(boots, [2.5, 97.5])A trend whose band overlaps a flat line across the whole range is not a trend.
Step 5: Label like a publication, not a notebook
Title states the finding ("Mentions of 'machinery' per million words, 1700-1900"), axes are labelled with units, the source and corpus size sit in a caption, and colour is colour-blind safe. These finishing steps are what separate a credible figure from a screenshot.
Key Takeaways
- Normalise to relative frequency (per million tokens) before plotting — raw counts track text survival.
- Bin by a unit dense enough for stable estimates; state the binning in the caption.
- Smooth and reveal: show the rolling mean and the raw points together.
- Always plot documents-per-bin beside the trend to catch corpus-composition artefacts.
- Use lines for few series, small multiples for many; avoid dual axes and comparison-by-stacked-area.
- Add a bootstrap confidence band so readers can judge the trend's reliability.
- Finish with publication-quality labels, units, source and accessible colour.
Frequently Asked Questions
Should I plot raw counts or relative frequencies?
Almost always relative frequencies (per million words, or as a share). Raw counts mostly track how much text survives from each period, not genuine cultural change.
How do I smooth a noisy trend without lying?
A rolling mean over a few time bins is fine, but always show the underlying points or a confidence band too, and state the window. Heavy smoothing hides the noise that tells readers how much to trust the line.
Why does my trend spike in one specific year?
Usually a corpus composition change — a big document or source joined that year — not a real cultural shift. Plot documents-per-bin alongside the trend to catch it.
Linear or log scale for cultural frequency data?
Use a log y-axis when you care about proportional change or when values span orders of magnitude; use linear for additive comparisons and when zero is meaningful.
What chart type suits trends over time?
A line chart for one or a few series; small multiples (faceted lines) when comparing many. Avoid stacked areas for comparison — they make individual trends hard to read.
How do I show uncertainty in a trend?
Bootstrap within each time bin and draw the confidence band, or show the raw scatter behind the line. A bare line implies a precision you usually don't have.