Best Practices to Import messy historical files in R

Q: Why do my accented characters turn into gibberish on import?

The file is not UTF-8 and readr assumed it was. Detect the encoding with readr::guess_encoding() and pass it explicitly, for example read_csv(file, locale = locale(encoding = 'latin1')). Legacy archival exports are frequently Windows-1252 or Latin-1, not UTF-8.

Q: How do I stop readr from guessing the wrong column type?

readr guesses types from the first 1000 rows, so a numeric column with a stray '[illegible]' later gets coerced badly. Set col_types explicitly with cols(), import the troublesome column as character, then clean and convert it deliberately.

Q: What is the best way to import a spreadsheet with multiple header rows?

Read it with no column names using skip to drop the junk rows, or read with col_names = FALSE and assign names yourself. The readxl package handles .xlsx; specify the sheet and a range like 'A4:G500' to bypass merged title rows.

Q: How do I import a whole folder of inconsistent CSVs at once?

Map over the files with purrr::map() and readr, binding them with a filename column so you can trace each row back to its source. Add type harmonisation in the same step so one odd file does not corrupt the combined table.

To import messy historical files in R reliably, declare the encoding, fix the column types explicitly, and keep every transformation in a script rather than editing by hand. The core toolkit is readr for delimited text, readxl for spreadsheets, and purrr for batches; the core discipline is to treat import as a documented stage that produces a clean, type-correct table plus a record of what you changed. This checklist keeps results consistent across an entire collection, not just one lucky file.

Why does my import look broken before I even start cleaning?

Three failures dominate, and all happen at read time: wrong encoding (mangled accents), wrong column types (numbers read as text or text coerced to NA), and structural junk (title rows, merged cells, footnotes). Solve them in that order. An import that silently swallows a value is worse than one that errors, because you may never notice the loss.

library(readr)
guess_encoding("registre_1788.csv")   # -> tibble of likely encodings + confidence

Run guess_encoding() first on any file older than your own workflow. Archival exports are routinely Windows-1252 or Latin-1, and assuming UTF-8 turns "Évêque" into "Ã‰vêque".

How do I declare encoding and column types explicitly?

Never let readr guess on data you intend to publish. State both the encoding and the types up front:

records <- read_csv(
  "registre_1788.csv",
  locale = locale(encoding = "Windows-1252"),
  col_types = cols(
    id        = col_character(),
    surname   = col_character(),
    amount    = col_character(),   # "12 shillings" — keep as text for now
    year      = col_integer(),
    .default  = col_character()
  )
)

Importing amount as character is deliberate: a column that mixes "12 shillings", "—" and "[illegible]" must arrive intact so you can parse it on your terms. readr infers types from only the first 1000 rows by default, so a stray non-numeric value on row 1500 would otherwise corrupt the whole column.

How do I get past multiple header rows and merged cells?

Spreadsheets from archives often carry a title banner, a sub-heading, and only then the real header. With readxl, target the data directly:

library(readxl)
df <- read_excel("census_returns.xlsx",
                 sheet = "1841",
                 range = "A4:H2300",   # skip merged title rows
                 col_names = TRUE)

Specifying an explicit range or skip is more robust than hoping the file is clean. For CSVs, read_csv(file, skip = 3) does the equivalent. Record the skip value in a comment — it is a fact about that specific file you will need again.

What does a defensible batch import look like?

When you have a folder of inconsistent files, import them together so problems surface once and provenance is preserved. Keep a source_file column on every row.

library(purrr); library(dplyr)

files <- list.files("data/parishes", pattern = "\\.csv$", full.names = TRUE)

combined <- map(files, \(f) {
  read_csv(f, col_types = cols(.default = col_character())) |>
    mutate(source_file = basename(f))
}) |>
  list_rbind()   # one odd file no longer derails the whole bind

Reading everything as character first, binding, then converting types once is the safest order: it stops a single rogue file from forcing a type mismatch across the batch.

How do I decide between readr, readxl and a manual fix?

File situation	Best tool	Note
Delimited text (CSV/TSV)	`readr::read_delim()`	Set `delim`, `locale`, `col_types`
Excel `.xlsx`/`.xls`	`readxl`	Specify `sheet` and `range`
Fixed-width ledgers	`readr::read_fwf()`	Define column widths once
Reconciliation, clustering	OpenRefine, then export	Scripted recipe, then re-import
Folder of mixed files	`purrr::map()` + `readr`	Add a `source_file` column

The rule that survives contact with real collections: if a fix cannot be re-run on the next batch, it does not count as done.

What belongs on the import checklist?

Run this before declaring an import complete:

Encoding detected with guess_encoding() and set explicitly in locale().
col_types declared; ambiguous columns read as character, not guessed.
Header/footer junk handled with skip or range, recorded in a comment.
Every row carries a source_file (and ideally a row number) for provenance.
A post-import check: nrow(), problems(), and summary() reviewed.
No value silently became NA — confirm with colSums(is.na(df)).
The whole import lives in a script that runs unattended on a new batch.

Key Takeaways

Import is a documented stage: encoding, types, structure — in that order.
Always run guess_encoding() on legacy files; do not assume UTF-8.
Declare col_types and read ambiguous columns as character to avoid silent NA.
Use readxl ranges to skip merged title rows in spreadsheets.
Batch-import with purrr::map() and a source_file column for provenance.
Fix files in a script, never by hand in Excel, so it re-runs on new data.
Check problems() and colSums(is.na()) before trusting the result.

Frequently Asked Questions

Why do my accented characters turn into gibberish on import?

The file is not UTF-8 and readr assumed it was. Detect the encoding with readr::guess_encoding() and pass it explicitly, for example read_csv(file, locale = locale(encoding = "latin1")). Legacy archival exports are frequently Windows-1252 or Latin-1, not UTF-8.

How do I stop readr from guessing the wrong column type?

readr guesses types from the first 1000 rows, so a numeric column with a stray "[illegible]" later gets coerced badly. Set col_types explicitly with cols(), import the troublesome column as character, then clean and convert it deliberately.

What is the best way to import a spreadsheet with multiple header rows?

Read it with no column names using skip to drop the junk rows, or read with col_names = FALSE and assign names yourself. The readxl package handles .xlsx; specify the sheet and a range like "A4:G500" to bypass merged title rows.

How should I handle a column that mixes numbers and text like '12 shillings'?

Import it as character so nothing is lost, then extract the parts with stringr or tidyr::separate(). Never let R silently coerce it, which would turn "12 shillings" into NA and destroy the value.

Should I fix messy files by hand in Excel or in R?

Fix them in R (or OpenRefine) so every change is scripted and repeatable across the whole collection. Hand-editing in Excel is unauditable, error-prone, and impossible to re-run when a new batch of files arrives.

How do I import a whole folder of inconsistent CSVs at once?

Map over the files with purrr::map() and readr, binding them with a filename column so you can trace each row back to its source. Add type harmonisation in the same step so one odd file does not corrupt the combined table.

Why does my import look broken before I even start cleaning? ​

How do I declare encoding and column types explicitly? ​

How do I get past multiple header rows and merged cells? ​

What does a defensible batch import look like? ​

How do I decide between readr, readxl and a manual fix? ​

What belongs on the import checklist? ​

Key Takeaways ​

Frequently Asked Questions ​

Why do my accented characters turn into gibberish on import? ​

How do I stop readr from guessing the wrong column type? ​

What is the best way to import a spreadsheet with multiple header rows? ​

How should I handle a column that mixes numbers and text like '12 shillings'? ​

Should I fix messy files by hand in Excel or in R? ​

How do I import a whole folder of inconsistent CSVs at once? ​

Related reading ​