Beginner's Guide to Historical tables in R

Q: What is the difference between left_join and inner_join?

left_join keeps every row from the first table and adds matches from the second, leaving NA where there is no match. inner_join keeps only rows that match in both — useful when you want to drop people who appear in just one source.

Q: How do I join on names when spelling varies?

An exact join will miss 'Smith' vs 'Smyth'. Normalise the key first (lowercase, strip punctuation) for easy cases, and for the rest use a separate fuzzy-matching step rather than expecting join() to be approximate.

Q: Should I use by = explicitly?

Yes. Always name the join column with by = 'id' rather than letting dplyr guess from shared names. Explicit keys prevent accidental matches on columns that happen to share a name like 'date' or 'notes'.

Q: How do I see which rows failed to match?

After a left_join, filter for NA in a column that only the right table supplies, or use anti_join(left, right) to get exactly the unmatched left rows. Always inspect non-matches in historical data — they are often the interesting cases.

Joining historical tables in R means lining up two datasets by a shared key — a person ID, a parish code, a record number — so information from both sits in one table. The everyday tool is dplyr's family of *_join() functions, and for beginners the two you need first are left_join() (keep everything on the left, add matches) and inner_join() (keep only rows that match in both). This guide builds the idea from scratch with a tiny worked example.

What is a join, in plain language?

Imagine two index cards. One lists people in a 1851 census with an id. The other lists baptisms, each tagged with the same id. A join walks down the first list and, for each person, looks up their baptism on the second list and writes it alongside. The id is the key that tells R which rows describe the same person.

library(dplyr)

people <- tibble(
  id   = c(1, 2, 3),
  name = c("Mary Hale", "John Webb", "Ann Frost")
)

baptisms <- tibble(
  id     = c(1, 3, 4),
  parish = c("St Giles", "St Mary", "St Anne")
)

Person 2 has no baptism; baptism 4 has no person. How a join handles those gaps is the whole story.

How do I keep everyone from my main table?

Use left_join(). It keeps every row of the left table and fills NA where the right table has no match.

left_join(people, baptisms, by = "id")
#> id  name        parish
#> 1   Mary Hale   St Giles
#> 2   John Webb   NA          <- no baptism found
#> 3   Ann Frost   St Mary

John Webb stays, with a missing parish. That NA is information: it tells you a record is absent, which in historical research is often exactly what you want to find.

How do I keep only people who appear in both?

Use inner_join(). It drops anyone without a match on either side.

inner_join(people, baptisms, by = "id")
#> id  name        parish
#> 1   Mary Hale   St Giles
#> 3   Ann Frost   St Mary

Join type	Keeps	Use when
`left_join`	All left rows	Your first table is the master list
`inner_join`	Only matched rows	You want records present in both sources
`full_join`	All rows from both	You want the union of two sources
`anti_join`	Left rows with no match	You want to study the non-matches

Why did my joined table grow extra rows?

Because the key is not unique. If baptisms had two rows with id = 1, Mary Hale would appear twice after the join — one row per match. Check first:

baptisms |> count(id) |> filter(n > 1)   # any duplicates?

This "row explosion" is the single most common beginner bug. Make each key unique, or decide deliberately that one-to-many duplication is what you want.

How do I find the rows that did not match?

Non-matches are often the historically interesting cases — people who vanish from one source. anti_join() returns exactly them:

anti_join(people, baptisms, by = "id")   # John Webb: in census, no baptism

A small end-to-end recipe

Confirm both tables share a clean key (count() to check uniqueness).
Name the key explicitly with by = "id".
Choose left_join (keep master) or inner_join (keep both).
Check the row count did not unexpectedly balloon.
Inspect non-matches with anti_join() before moving on.

Key Takeaways

A join lines up two tables by a shared key that identifies the same record.
left_join() keeps every left row; inner_join() keeps only matched rows.
Always set by = "..." explicitly so dplyr never guesses the wrong column.
Duplicate keys cause row explosion — check with count() before joining.
NA after a left join means a record is missing, which is often the finding.
anti_join() surfaces the unmatched rows you most want to investigate.
Name spelling variation needs normalising or a separate fuzzy step, not the join.

Frequently Asked Questions

What is the difference between left_join and inner_join?

left_join keeps every row from the first table and adds matches from the second, leaving NA where there is no match. inner_join keeps only rows that match in both — useful when you want to drop people who appear in just one source.

Why did my join create more rows than I started with?

Your key is not unique on at least one side, so each left row matches several right rows. This row explosion is the commonest join bug; check with count(table, key) and resolve duplicates before joining.

How do I join on names when spelling varies?

An exact join will miss "Smith" vs "Smyth". Normalise the key first (lowercase, strip punctuation) for easy cases, and for the rest use a separate fuzzy-matching step rather than expecting join() to be approximate.

What is a key, in plain terms?

A key is the column (or columns) that identifies which rows in two tables refer to the same thing — a person ID, a parish code, a record number. Joins line tables up by matching keys.

Should I use by = explicitly?

Yes. Always name the join column with by = "id" rather than letting dplyr guess from shared names. Explicit keys prevent accidental matches on columns that happen to share a name like "date" or "notes".

How do I see which rows failed to match?

After a left_join, filter for NA in a column that only the right table supplies, or use anti_join(left, right) to get exactly the unmatched left rows. Always inspect non-matches in historical data — they are often the interesting cases.

What is a join, in plain language? ​

How do I keep everyone from my main table? ​

How do I keep only people who appear in both? ​

Why did my joined table grow extra rows? ​

How do I find the rows that did not match? ​

A small end-to-end recipe ​

Key Takeaways ​

Frequently Asked Questions ​

What is the difference between left_join and inner_join? ​

Why did my join create more rows than I started with? ​

How do I join on names when spelling varies? ​

What is a key, in plain terms? ​

Should I use by = explicitly? ​

How do I see which rows failed to match? ​

Related reading ​