Appearance
To clean messy data in R, profile it first to see what is actually wrong, then fix problems in a fixed order: trim and standardise text, parse dates and numbers, reconcile inconsistent categories, flag (not delete) missing values, and validate against logic checks. Do all of it in a script of piped dplyr steps so the cleaning is reproducible and every change is documented. The script, not hand-editing, is what makes the result trustworthy.
Step 1: profile before you touch anything
Cleaning blind wastes time on the wrong problems. Get a picture first:
r
library(tidyverse)
library(skimr)
records <- read_csv("data/baptisms.csv")
skim(records) # ranges, missingness, types per column
records |> count(parish, sort = TRUE) # spot category chaosskim() shows you the spread of ages, how much is missing, and which columns read as the wrong type. Note the problems before fixing them.
Step 2: clean up text fields
Whitespace and case inconsistency are the cheapest wins. Standardise into new columns:
r
records <- records |>
mutate(
surname_clean = surname |> str_squish() |> str_to_title(),
parish_clean = parish |> str_squish() |> str_to_lower()
)str_squish() removes leading, trailing and repeated internal spaces in one call, which fixes most transcription whitespace at a stroke.
Step 3: how do you reconcile inconsistent spellings?
Historical names vary wildly: Smyth, Smithe, Smith. Obvious cases yield to a lookup; the rest need fuzzy clustering:
r
# install.packages("refinr")
library(refinr)
records <- records |>
mutate(surname_keyed = key_collision_merge(surname_clean),
surname_keyed = n_gram_merge(surname_keyed))refinr brings OpenRefine's clustering algorithms into R. Always keep surname (raw) beside surname_keyed so a reviewer can audit every merge.
Step 4: parse dates and numbers properly
Numbers read as text and dates read as numbers are classic faults. Coerce explicitly and flag failures:
r
library(lubridate)
records <- records |>
mutate(
age = parse_number(age), # "34 yrs" -> 34
date = dmy(date_raw), # "3 May 1789" -> 1789-05-03
date_failed = is.na(date) & !is.na(date_raw)
)Inspecting date_failed is faster than re-reading the whole column and isolates the genuinely ambiguous entries.
Step 5: handle missing values as evidence
Do not delete rows just because a field is blank. In archival data, absence is informative. Make it explicit and decide per analysis:
r
records |>
summarise(across(everything(), ~ sum(is.na(.x)))) # count NAs per columnA blank occupation for an infant is expected; a blank surname for an adult may be a transcription gap worth chasing. Globally dropping NA rows silently shrinks and biases your dataset.
Step 6: validate with logic checks
Catch the impossible before it reaches analysis. Range and cross-field rules surface transcription slips:
r
suspect <- records |>
filter(age > 110 | age < 0 |
(!is.na(birth) & !is.na(death) & death < birth))
suspectFilter to the failing rows, inspect them against the original source, and correct what is clearly an error. The table below summarises the common faults and their fix:
| Problem | Symptom | Fix |
|---|---|---|
| Whitespace/case | " Smith ", "SMITH" | str_squish(), str_to_title() |
| Spelling variants | Smyth, Smithe | refinr clustering |
| Wrong type | age as text | parse_number() |
| Bad dates | mixed formats | lubridate, flag failures |
| Impossible values | age 250 | range and logic checks |
Step 7: keep it reproducible
Never edit cells by hand. Chain every step so the whole pipeline reruns on updated data:
r
clean <- records |>
mutate(across(where(is.character), str_squish)) |>
mutate(date = dmy(date_raw), age = parse_number(age)) |>
filter(age <= 110 | is.na(age))
write_csv(clean, "data-clean/baptisms-clean.csv")The script is your audit trail: it documents exactly what changed and lets anyone reproduce the cleaned file from the raw source.
Key Takeaways
- Profile first with
skim(); never clean blind. - Standardise text with
str_squish()and case functions into new columns. - Cluster spelling variants with
refinr; keep the raw value alongside. - Coerce numbers and dates explicitly and flag the parse failures.
- Treat missing values as evidence; flag rather than delete rows.
- Validate with range and cross-field logic checks against the source.
- Script every step in piped
dplyrfor a reproducible audit trail.
Frequently Asked Questions
What should I do before changing any value in R?
Profile the data first with skimr::skim() or summary() to see ranges, missingness and odd values. Cleaning blind risks fixing the wrong thing; profiling tells you what is actually broken before you touch it.
How do I clean inconsistent spellings of the same name?
Standardise with string functions for obvious cases, then group near-duplicates with stringdist or the refinr package, which brings OpenRefine-style clustering into R. Keep the raw spelling in a separate column.
Should I delete rows with missing values?
Usually not. In historical sources missingness is itself evidence. Flag missing values explicitly and decide per analysis whether to exclude them, rather than dropping rows globally and silently shrinking your dataset.
How do I catch impossible historical values?
Write range and logic checks, for example an age over 110 or a death date before a birth date. Filter to the failing rows and inspect them; many are transcription slips you can correct against the source.
How do I make my cleaning steps reproducible?
Put every transformation in a script as piped dplyr steps rather than editing cells by hand. The script documents exactly what changed, can be rerun on updated data, and serves as an audit trail.