How to Structure tidy data for the humanities

Structuring tidy data for the humanities means reshaping your transcribed sources so that each variable is a column, each observation is a row, and each kind of thing gets its own table. The trick for historical material is to do this without destroying the source: keep a verbatim transcription, then derive a clean, machine-readable analytical table from it, recording uncertainty in dedicated columns rather than mangling values. Tidy data is the shape that lets you sort, filter, join and analyse in tools like pandas, R or OpenRefine without endless reshaping.

What are the rules of tidy data?

Tidy data, as defined by Hadley Wickham, follows three rules:

Each variable is a column.
Each observation is a row.
Each type of observational unit is its own table.

For a baptism register that means one row per baptism, with columns for date, child, father, mother, parish — not one row per family with children crammed into a single cell.

What does untidy humanities data look like?

The classic offenders are easy to spot once you know them.

text

Untidy: multiple values in one cell
  godparents
  "John Smith; Mary Brown; Anne Doe"

Untidy: variables stored as rows of a "notes" blob
  "baptised 3 May, father a weaver, mother deceased"

Untidy: a table mixing two units (people AND events)

Each of these blocks computation. You cannot count godparents, filter by occupation, or join people to events while the data is shaped like prose.

How do I reshape a source into tidy form, step by step?

Work from a verbatim transcription toward an analytical table.

text

1. Transcribe faithfully first — keep the source's own words.
2. Identify the observational unit (here: one baptism event).
3. List the variables you will analyse: date, child, father...
4. One row per unit; split multi-value cells into a related table.
5. Normalise machine-readable columns; keep verbatim originals.
6. Document every transformation in the data dictionary.

The godparents example becomes a separate godparents table, one row per godparent, linked back by a baptism_id. That is rule three in action: people and events are different units.

How should I handle dates, names and uncertainty?

These three are where humanities data diverges from textbook tidy data.

Issue	Untidy fix	Tidy fix
Dates	`"3 May"` jammed in one column	`date_norm` (EDTF) + `date_verbatim`
Names	inconsistent spellings overwritten	`name_std` + `name_verbatim`
Uncertainty	`"1801?"`	clean value + `certainty` flag column
Illegible	guessed silently	value + `[illegible]` marker, documented

The principle is to never overwrite the source. Keep a verbatim column and a normalised column side by side so analysis runs on the clean one while the faithful record survives.

A small worked transformation

Here is the same data before and after, in pandas:

python

import pandas as pd

# Untidy: godparents mashed into one cell
df = pd.read_csv("baptisms_raw.csv")

# Split into a tidy related table, one godparent per row
gp = (df.assign(godparent=df["godparents"].str.split(";"))
        .explode("godparent")
        .loc[:, ["baptism_id", "godparent"]])
gp["godparent"] = gp["godparent"].str.strip()
gp.to_csv("godparents.csv", index=False)

Now godparents.csv is tidy: countable, filterable, joinable.

When does tidy data not fit?

Be honest about the limits. Deeply narrative sources — a diary, a letter, a chronicle — do not reduce to rows and columns without violence. The right move is to extract the analysable layer (mentions of people, dates, places) into tidy tables while preserving the full text elsewhere, perhaps as TEI. Tidy data is a target for the structured extraction, not a demand that all humanities material be flattened.

Pitfalls to avoid

Overwriting the source spelling of names and dates — keep verbatim columns.
Cramming multiple values into one cell instead of a related table.
Encoding uncertainty by mangling the value rather than flagging it separately.
Mixing two observational units (people and events) in one table.
Tidying without documenting the transformation in a data dictionary.

Key Takeaways

Tidy data = one variable per column, one observation per row, one unit per table.
Always derive the tidy table from a faithful verbatim transcription; never lose the source.
Store normalised and verbatim versions of dates and names side by side.
Record uncertainty in a dedicated certainty column, not by corrupting the value.
Split multi-value cells into related tables linked by an ID.
Narrative sources need tidy extractions, not wholesale flattening.

Frequently Asked Questions

What are the three rules of tidy data?

Each variable forms a column, each observation forms a row, and each type of observational unit forms its own table. Hadley Wickham formalised these rules, and they apply directly to historical tabular data.

Do humanities sources fit the tidy data model?

Tabular and serial sources like registers, censuses and account books fit well. Narrative or deeply nested sources fit less neatly, so tidy data is a target for the analysable extraction, not a demand that you flatten everything you read.

Should I record uncertainty in tidy data?

Yes, but in a dedicated column rather than by mangling the value. Keep the best-estimate value clean and machine-readable, and put flags like uncertain, inferred or illegible in a separate certainty column.

How do I handle dates in historical tidy data?

Store a normalised machine-readable date, ideally ISO 8601 or EDTF for partial and uncertain dates, in its own column, and keep the original verbatim date string in another. Never overwrite the source spelling of a date.

What is the difference between tidy data and clean data?

Clean data is free of errors and inconsistencies; tidy data is a specific structural shape where variables are columns and observations are rows. Data can be clean but untidy, or tidy but still contain errors.

Can I keep the original messy transcription as well?

Yes, and you should. Keep a verbatim transcription as the faithful record and derive the tidy analytical table from it, documenting the transformation so the link between source and structure stays traceable.

What are the rules of tidy data? ​

What does untidy humanities data look like? ​

How do I reshape a source into tidy form, step by step? ​

How should I handle dates, names and uncertainty? ​

A small worked transformation ​

When does tidy data not fit? ​

Pitfalls to avoid ​

Key Takeaways ​

Frequently Asked Questions ​

What are the three rules of tidy data? ​

Do humanities sources fit the tidy data model? ​

Should I record uncertainty in tidy data? ​

How do I handle dates in historical tidy data? ​

What is the difference between tidy data and clean data? ​

Can I keep the original messy transcription as well? ​

Related reading ​