Appearance
To import messy historical files in R reliably, declare the encoding, fix the column types explicitly, and keep every transformation in a script rather than editing by hand. The core toolkit is readr for delimited text, readxl for spreadsheets, and purrr for batches; the core discipline is to treat import as a documented stage that produces a clean, type-correct table plus a record of what you changed. This checklist keeps results consistent across an entire collection, not just one lucky file.
Why does my import look broken before I even start cleaning?
Three failures dominate, and all happen at read time: wrong encoding (mangled accents), wrong column types (numbers read as text or text coerced to NA), and structural junk (title rows, merged cells, footnotes). Solve them in that order. An import that silently swallows a value is worse than one that errors, because you may never notice the loss.
r
library(readr)
guess_encoding("registre_1788.csv") # -> tibble of likely encodings + confidenceRun guess_encoding() first on any file older than your own workflow. Archival exports are routinely Windows-1252 or Latin-1, and assuming UTF-8 turns "Évêque" into "Évêque".
How do I declare encoding and column types explicitly?
Never let readr guess on data you intend to publish. State both the encoding and the types up front:
r
records <- read_csv(
"registre_1788.csv",
locale = locale(encoding = "Windows-1252"),
col_types = cols(
id = col_character(),
surname = col_character(),
amount = col_character(), # "12 shillings" — keep as text for now
year = col_integer(),
.default = col_character()
)
)Importing amount as character is deliberate: a column that mixes "12 shillings", "—" and "[illegible]" must arrive intact so you can parse it on your terms. readr infers types from only the first 1000 rows by default, so a stray non-numeric value on row 1500 would otherwise corrupt the whole column.
How do I get past multiple header rows and merged cells?
Spreadsheets from archives often carry a title banner, a sub-heading, and only then the real header. With readxl, target the data directly:
r
library(readxl)
df <- read_excel("census_returns.xlsx",
sheet = "1841",
range = "A4:H2300", # skip merged title rows
col_names = TRUE)Specifying an explicit range or skip is more robust than hoping the file is clean. For CSVs, read_csv(file, skip = 3) does the equivalent. Record the skip value in a comment — it is a fact about that specific file you will need again.
What does a defensible batch import look like?
When you have a folder of inconsistent files, import them together so problems surface once and provenance is preserved. Keep a source_file column on every row.
r
library(purrr); library(dplyr)
files <- list.files("data/parishes", pattern = "\\.csv$", full.names = TRUE)
combined <- map(files, \(f) {
read_csv(f, col_types = cols(.default = col_character())) |>
mutate(source_file = basename(f))
}) |>
list_rbind() # one odd file no longer derails the whole bindReading everything as character first, binding, then converting types once is the safest order: it stops a single rogue file from forcing a type mismatch across the batch.
How do I decide between readr, readxl and a manual fix?
| File situation | Best tool | Note |
|---|---|---|
| Delimited text (CSV/TSV) | readr::read_delim() | Set delim, locale, col_types |
Excel .xlsx/.xls | readxl | Specify sheet and range |
| Fixed-width ledgers | readr::read_fwf() | Define column widths once |
| Reconciliation, clustering | OpenRefine, then export | Scripted recipe, then re-import |
| Folder of mixed files | purrr::map() + readr | Add a source_file column |
The rule that survives contact with real collections: if a fix cannot be re-run on the next batch, it does not count as done.
What belongs on the import checklist?
Run this before declaring an import complete:
- Encoding detected with
guess_encoding()and set explicitly inlocale(). col_typesdeclared; ambiguous columns read as character, not guessed.- Header/footer junk handled with
skiporrange, recorded in a comment. - Every row carries a
source_file(and ideally a row number) for provenance. - A post-import check:
nrow(),problems(), andsummary()reviewed. - No value silently became
NA— confirm withcolSums(is.na(df)). - The whole import lives in a script that runs unattended on a new batch.
Key Takeaways
- Import is a documented stage: encoding, types, structure — in that order.
- Always run
guess_encoding()on legacy files; do not assume UTF-8. - Declare
col_typesand read ambiguous columns as character to avoid silentNA. - Use
readxlranges to skip merged title rows in spreadsheets. - Batch-import with
purrr::map()and asource_filecolumn for provenance. - Fix files in a script, never by hand in Excel, so it re-runs on new data.
- Check
problems()andcolSums(is.na())before trusting the result.
Frequently Asked Questions
Why do my accented characters turn into gibberish on import?
The file is not UTF-8 and readr assumed it was. Detect the encoding with readr::guess_encoding() and pass it explicitly, for example read_csv(file, locale = locale(encoding = "latin1")). Legacy archival exports are frequently Windows-1252 or Latin-1, not UTF-8.
How do I stop readr from guessing the wrong column type?
readr guesses types from the first 1000 rows, so a numeric column with a stray "[illegible]" later gets coerced badly. Set col_types explicitly with cols(), import the troublesome column as character, then clean and convert it deliberately.
What is the best way to import a spreadsheet with multiple header rows?
Read it with no column names using skip to drop the junk rows, or read with col_names = FALSE and assign names yourself. The readxl package handles .xlsx; specify the sheet and a range like "A4:G500" to bypass merged title rows.
How should I handle a column that mixes numbers and text like '12 shillings'?
Import it as character so nothing is lost, then extract the parts with stringr or tidyr::separate(). Never let R silently coerce it, which would turn "12 shillings" into NA and destroy the value.
Should I fix messy files by hand in Excel or in R?
Fix them in R (or OpenRefine) so every change is scripted and repeatable across the whole collection. Hand-editing in Excel is unauditable, error-prone, and impossible to re-run when a new batch of files arrives.
How do I import a whole folder of inconsistent CSVs at once?
Map over the files with purrr::map() and readr, binding them with a filename column so you can trace each row back to its source. Add type harmonisation in the same step so one odd file does not corrupt the combined table.