Wrangle historical data with the tidyverse: A Practical

Q: Which tidyverse package handles messy historical dates?

Use lubridate for dates and readr for reading typed columns. lubridate parses partial and inconsistent date strings, while readr's col_types argument stops it guessing a date column is a number.

Q: How do I split a single cell holding several values?

Use tidyr::separate_longer_delim() or separate_wider_delim() to break a cell like 'wife; daughter; servant' into separate rows or columns on a delimiter.

Q: How do I keep the original archival value while cleaning?

Add the cleaned value in a new column with mutate() rather than overwriting the source. Keeping name_raw alongside name_clean preserves provenance and lets you audit every transformation.

To wrangle historical data with the tidyverse, read your source with readr (forcing column types), reshape it into tidy form with tidyr (one observation per row), then clean and derive fields with dplyr verbs (filter, mutate, group_by, summarise) chained by the pipe. This pipeline turns a ragged spreadsheet of parish or census records into an analysable table while preserving the original values for provenance.

What does "tidy" mean for archival sources?

Tidy data follows three rules: one variable per column, one observation per row, one value per cell. Archival data rarely arrives this way. A transcribed household schedule might cram several people into one row, or jam "age 34, ag. lab." into a single cell. The job of wrangling is to reach tidy form without losing the messiness that is itself evidence.

Keep the raw value. Always add cleaned columns alongside originals:

library(tidyverse)
census |>
  mutate(occupation_clean = str_to_lower(str_trim(occupation)))
# occupation (raw) is untouched; occupation_clean is derived

How do you read a messy file without corruption?

Let readr read, but tell it the types so it does not guess wrongly:

census <- read_csv(
  here::here("data", "census-1851.csv"),
  col_types = cols(
    age = col_integer(),
    parish = col_character(),
    date = col_date(format = "%d %B %Y"),
    .default = col_character()
  ),
  na = c("", "NA", "—", "ditto")
)

Forcing .default = col_character() stops readr turning a folio reference like 007 into the number 7, which silently destroys leading zeros in catalogue identifiers.

How do you split multi-value cells?

A field holding "wife; daughter; servant" violates the one-value rule. Reshape it:

household |>
  separate_longer_delim(relations, delim = "; ")
# one relation per row, household id repeated

Use separate_wider_delim() instead when each piece is a distinct variable, such as splitting "Smith, John" into surname and forename.

Which dplyr verbs do most of the work?

Five verbs cover almost everything in historical wrangling:

Verb	Use in historical work
`filter()`	Keep rows, e.g. only baptisms after 1813
`mutate()`	Derive columns, e.g. standardise spelling
`select()`	Drop archival admin columns you do not need
`group_by()` + `summarise()`	Aggregate, e.g. burials per parish per decade
`arrange()`	Order, e.g. by year then surname

A worked aggregation:

parish_registers |>
  filter(event == "burial", year >= 1700, year <= 1799) |>
  mutate(decade = (year %/% 10) * 10) |>
  group_by(parish, decade) |>
  summarise(burials = n(), .groups = "drop") |>
  arrange(parish, decade)

How do you handle dates that are barely dates?

Historical dates are partial, dual-style, or written in full. Use lubridate defensively and flag what fails:

library(lubridate)
events |>
  mutate(
    date_parsed = dmy(date_raw),         # "14 March 1789" -> 1789-03-14
    parse_failed = is.na(date_parsed) & !is.na(date_raw)
  )

Reviewing the parse_failed rows is faster than scanning the whole column, and it surfaces the genuinely ambiguous entries (e.g. Old Style/New Style) for manual judgement.

How do you reshape between long and wide?

Counts by category often need flipping. To turn one-row-per-event into a parish-by-decade matrix:

counts |>
  pivot_wider(names_from = decade, values_from = burials, values_fill = 0)

And back again with pivot_longer() when a tool downstream expects tidy long format. Most charting and modelling in R prefers long; human-readable tables prefer wide.

Key Takeaways

Aim for tidy form: one variable per column, one observation per row.
Force column types with col_types so identifiers and dates are not mangled.
Split multi-value cells with separate_longer_delim() / separate_wider_delim().
Derive cleaned columns with mutate(); never overwrite the raw archival value.
Parse dates with lubridate and explicitly flag the rows that fail.
Pivot between long and wide as the downstream tool requires.
Chain verbs with the pipe so each transformation is auditable.

Frequently Asked Questions

What does it mean to make historical data tidy?

Tidy data has one variable per column, one observation per row, and one value per cell. For a census this means each person is a row and fields like name, age and occupation are columns, with no merged or multi-value cells.

Which tidyverse package handles messy historical dates?

Use lubridate for dates and readr for reading typed columns. lubridate parses partial and inconsistent date strings, while readr's col_types argument stops it guessing a date column is a number.

How do I split a single cell holding several values?

Use tidyr::separate_longer_delim() or separate_wider_delim() to break a cell like "wife; daughter; servant" into separate rows or columns on a delimiter.

Should I use the native pipe or magrittr's pipe?

Either works. R 4.1 and later ship a native pipe written as the base operator; magrittr's pipe is older and slightly more flexible. For new humanities projects the native pipe is fine and has no dependency.

How do I keep the original archival value while cleaning?

Add the cleaned value in a new column with mutate() rather than overwriting the source. Keeping name_raw alongside name_clean preserves provenance and lets you audit every transformation.

What does "tidy" mean for archival sources? ​

How do you read a messy file without corruption? ​

How do you split multi-value cells? ​

Which dplyr verbs do most of the work? ​

How do you handle dates that are barely dates? ​

How do you reshape between long and wide? ​

Key Takeaways ​

Frequently Asked Questions ​

What does it mean to make historical data tidy? ​

Which tidyverse package handles messy historical dates? ​

How do I split a single cell holding several values? ​

Should I use the native pipe or magrittr's pipe? ​

How do I keep the original archival value while cleaning? ​

Related reading ​