Skip to content
R for the Humanities

To clean messy data in R, profile it first to see what is actually wrong, then fix problems in a fixed order: trim and standardise text, parse dates and numbers, reconcile inconsistent categories, flag (not delete) missing values, and validate against logic checks. Do all of it in a script of piped dplyr steps so the cleaning is reproducible and every change is documented. The script, not hand-editing, is what makes the result trustworthy.

Step 1: profile before you touch anything

Cleaning blind wastes time on the wrong problems. Get a picture first:

r
library(tidyverse)
library(skimr)

records <- read_csv("data/baptisms.csv")
skim(records)          # ranges, missingness, types per column
records |> count(parish, sort = TRUE)   # spot category chaos

skim() shows you the spread of ages, how much is missing, and which columns read as the wrong type. Note the problems before fixing them.

Step 2: clean up text fields

Whitespace and case inconsistency are the cheapest wins. Standardise into new columns:

r
records <- records |>
  mutate(
    surname_clean = surname |> str_squish() |> str_to_title(),
    parish_clean  = parish  |> str_squish() |> str_to_lower()
  )

str_squish() removes leading, trailing and repeated internal spaces in one call, which fixes most transcription whitespace at a stroke.

Step 3: how do you reconcile inconsistent spellings?

Historical names vary wildly: Smyth, Smithe, Smith. Obvious cases yield to a lookup; the rest need fuzzy clustering:

r
# install.packages("refinr")
library(refinr)
records <- records |>
  mutate(surname_keyed = key_collision_merge(surname_clean),
         surname_keyed = n_gram_merge(surname_keyed))

refinr brings OpenRefine's clustering algorithms into R. Always keep surname (raw) beside surname_keyed so a reviewer can audit every merge.

Step 4: parse dates and numbers properly

Numbers read as text and dates read as numbers are classic faults. Coerce explicitly and flag failures:

r
library(lubridate)
records <- records |>
  mutate(
    age = parse_number(age),          # "34 yrs" -> 34
    date = dmy(date_raw),             # "3 May 1789" -> 1789-05-03
    date_failed = is.na(date) & !is.na(date_raw)
  )

Inspecting date_failed is faster than re-reading the whole column and isolates the genuinely ambiguous entries.

Step 5: handle missing values as evidence

Do not delete rows just because a field is blank. In archival data, absence is informative. Make it explicit and decide per analysis:

r
records |>
  summarise(across(everything(), ~ sum(is.na(.x))))  # count NAs per column

A blank occupation for an infant is expected; a blank surname for an adult may be a transcription gap worth chasing. Globally dropping NA rows silently shrinks and biases your dataset.

Step 6: validate with logic checks

Catch the impossible before it reaches analysis. Range and cross-field rules surface transcription slips:

r
suspect <- records |>
  filter(age > 110 | age < 0 |
         (!is.na(birth) & !is.na(death) & death < birth))
suspect

Filter to the failing rows, inspect them against the original source, and correct what is clearly an error. The table below summarises the common faults and their fix:

ProblemSymptomFix
Whitespace/case" Smith ", "SMITH"str_squish(), str_to_title()
Spelling variantsSmyth, Smitherefinr clustering
Wrong typeage as textparse_number()
Bad datesmixed formatslubridate, flag failures
Impossible valuesage 250range and logic checks

Step 7: keep it reproducible

Never edit cells by hand. Chain every step so the whole pipeline reruns on updated data:

r
clean <- records |>
  mutate(across(where(is.character), str_squish)) |>
  mutate(date = dmy(date_raw), age = parse_number(age)) |>
  filter(age <= 110 | is.na(age))
write_csv(clean, "data-clean/baptisms-clean.csv")

The script is your audit trail: it documents exactly what changed and lets anyone reproduce the cleaned file from the raw source.

Key Takeaways

  • Profile first with skim(); never clean blind.
  • Standardise text with str_squish() and case functions into new columns.
  • Cluster spelling variants with refinr; keep the raw value alongside.
  • Coerce numbers and dates explicitly and flag the parse failures.
  • Treat missing values as evidence; flag rather than delete rows.
  • Validate with range and cross-field logic checks against the source.
  • Script every step in piped dplyr for a reproducible audit trail.

Frequently Asked Questions

What should I do before changing any value in R?

Profile the data first with skimr::skim() or summary() to see ranges, missingness and odd values. Cleaning blind risks fixing the wrong thing; profiling tells you what is actually broken before you touch it.

How do I clean inconsistent spellings of the same name?

Standardise with string functions for obvious cases, then group near-duplicates with stringdist or the refinr package, which brings OpenRefine-style clustering into R. Keep the raw spelling in a separate column.

Should I delete rows with missing values?

Usually not. In historical sources missingness is itself evidence. Flag missing values explicitly and decide per analysis whether to exclude them, rather than dropping rows globally and silently shrinking your dataset.

How do I catch impossible historical values?

Write range and logic checks, for example an age over 110 or a death date before a birth date. Filter to the failing rows and inspect them; many are transcription slips you can correct against the source.

How do I make my cleaning steps reproducible?

Put every transformation in a script as piped dplyr steps rather than editing cells by hand. The script documents exactly what changed, can be rerun on updated data, and serves as an audit trail.