Appearance
Joining historical tables in R means lining up two datasets by a shared key — a person ID, a parish code, a record number — so information from both sits in one table. The everyday tool is dplyr's family of *_join() functions, and for beginners the two you need first are left_join() (keep everything on the left, add matches) and inner_join() (keep only rows that match in both). This guide builds the idea from scratch with a tiny worked example.
What is a join, in plain language?
Imagine two index cards. One lists people in a 1851 census with an id. The other lists baptisms, each tagged with the same id. A join walks down the first list and, for each person, looks up their baptism on the second list and writes it alongside. The id is the key that tells R which rows describe the same person.
r
library(dplyr)
people <- tibble(
id = c(1, 2, 3),
name = c("Mary Hale", "John Webb", "Ann Frost")
)
baptisms <- tibble(
id = c(1, 3, 4),
parish = c("St Giles", "St Mary", "St Anne")
)Person 2 has no baptism; baptism 4 has no person. How a join handles those gaps is the whole story.
How do I keep everyone from my main table?
Use left_join(). It keeps every row of the left table and fills NA where the right table has no match.
r
left_join(people, baptisms, by = "id")
#> id name parish
#> 1 Mary Hale St Giles
#> 2 John Webb NA <- no baptism found
#> 3 Ann Frost St MaryJohn Webb stays, with a missing parish. That NA is information: it tells you a record is absent, which in historical research is often exactly what you want to find.
How do I keep only people who appear in both?
Use inner_join(). It drops anyone without a match on either side.
r
inner_join(people, baptisms, by = "id")
#> id name parish
#> 1 Mary Hale St Giles
#> 3 Ann Frost St Mary| Join type | Keeps | Use when |
|---|---|---|
left_join | All left rows | Your first table is the master list |
inner_join | Only matched rows | You want records present in both sources |
full_join | All rows from both | You want the union of two sources |
anti_join | Left rows with no match | You want to study the non-matches |
Why did my joined table grow extra rows?
Because the key is not unique. If baptisms had two rows with id = 1, Mary Hale would appear twice after the join — one row per match. Check first:
r
baptisms |> count(id) |> filter(n > 1) # any duplicates?This "row explosion" is the single most common beginner bug. Make each key unique, or decide deliberately that one-to-many duplication is what you want.
How do I find the rows that did not match?
Non-matches are often the historically interesting cases — people who vanish from one source. anti_join() returns exactly them:
r
anti_join(people, baptisms, by = "id") # John Webb: in census, no baptismA small end-to-end recipe
- Confirm both tables share a clean key (
count()to check uniqueness). - Name the key explicitly with
by = "id". - Choose
left_join(keep master) orinner_join(keep both). - Check the row count did not unexpectedly balloon.
- Inspect non-matches with
anti_join()before moving on.
Key Takeaways
- A join lines up two tables by a shared key that identifies the same record.
left_join()keeps every left row;inner_join()keeps only matched rows.- Always set
by = "..."explicitly so dplyr never guesses the wrong column. - Duplicate keys cause row explosion — check with
count()before joining. NAafter a left join means a record is missing, which is often the finding.anti_join()surfaces the unmatched rows you most want to investigate.- Name spelling variation needs normalising or a separate fuzzy step, not the join.
Frequently Asked Questions
What is the difference between left_join and inner_join?
left_join keeps every row from the first table and adds matches from the second, leaving NA where there is no match. inner_join keeps only rows that match in both — useful when you want to drop people who appear in just one source.
Why did my join create more rows than I started with?
Your key is not unique on at least one side, so each left row matches several right rows. This row explosion is the commonest join bug; check with count(table, key) and resolve duplicates before joining.
How do I join on names when spelling varies?
An exact join will miss "Smith" vs "Smyth". Normalise the key first (lowercase, strip punctuation) for easy cases, and for the rest use a separate fuzzy-matching step rather than expecting join() to be approximate.
What is a key, in plain terms?
A key is the column (or columns) that identifies which rows in two tables refer to the same thing — a person ID, a parish code, a record number. Joins line tables up by matching keys.
Should I use by = explicitly?
Yes. Always name the join column with by = "id" rather than letting dplyr guess from shared names. Explicit keys prevent accidental matches on columns that happen to share a name like "date" or "notes".
How do I see which rows failed to match?
After a left_join, filter for NA in a column that only the right table supplies, or use anti_join(left, right) to get exactly the unmatched left rows. Always inspect non-matches in historical data — they are often the interesting cases.