Skip to content
Research Data Curation

A data dictionary is the documentation that defines every field in your dataset — its name, meaning, type, allowed values, units and missing-data convention — so anyone can interpret a column or coded value without guessing. To write one, build a table with a row per field and columns for name, description, type, allowed values and notes, then fill it in as the dataset takes shape. For historians the highest-value entries are the coded values and the conventions for dates, names and uncertainty, because those are the first things memory loses.

Why does a dataset need a data dictionary?

Because columns are silent. A field called status holding 1, 2, 3 is meaningless to a colleague, a peer reviewer, or you in eighteen months. The data dictionary is where 1 = baptised, 2 = received into the church, 3 = uncertain is written down, along with how you decided. Without it, a dataset is unauditable and effectively unreusable, however clean the numbers look.

What should every data dictionary entry contain?

Use a fixed set of columns so the dictionary is itself tidy and, ideally, machine-readable.

field_namedescriptiontypeallowed_valuesmissingnotes
baptism_idUnique record IDinteger1–4012noneprimary key
date_normNormalised dateEDTFYYYY-MM-DD`` (empty)from date_verbatim
date_verbatimDate as writtenstringfree text[illegible]source spelling kept
statusRite recordedinteger1,2,3-11=baptised,2=received,3=uncertain
father_occFather's occupationstringfree text`` (empty)not normalised

Notice the dictionary itself records the missing-data convention per field — a detail that silently breaks analyses when undocumented.

How do I document dates, names and uncertainty?

These three fields cause the most downstream confusion in historical data, so be explicit:

  • Dates: state the format (EDTF / ISO 8601), and that a verbatim column preserves the original. Document how regnal years or partial dates were converted.
  • Names: note whether the column is standardised or verbatim, and which authority (if any) you reconciled against.
  • Uncertainty: define every certainty flag. inferred, uncertain and illegible must each have a written meaning.

status = 3 (uncertain) was assigned where the register entry was damaged or ambiguously worded; the verbatim text is preserved in notes for re-judgement.

What format should I write it in?

Two good choices, often used together:

text
data_dictionary.md   -> human-readable, renders in the repo
data_dictionary.csv  -> machine-readable, drives validation

The CSV form lets you validate the dataset against its own dictionary. For example, with a schema tool:

python
import pandas as pd

dd = pd.read_csv("data_dictionary.csv")
df = pd.read_csv("baptisms.csv")

# Check every documented field actually exists, and vice versa
documented = set(dd["field_name"])
present = set(df.columns)
print("Undocumented columns:", present - documented)
print("Missing columns:", documented - present)

A green run here means the data and its documentation agree — a small but powerful curation check.

How do I keep the dictionary in sync with the data?

The dictionary rots the moment the schema changes and the documentation does not. Three habits prevent drift:

  • Edit the dictionary in the same commit that changes the schema.
  • Run the sync check above in a pre-deposit script.
  • Record schema changes in the dataset CHANGELOG, referencing the field affected.

A step-by-step workflow

text
1. Freeze the column list once the data shape is stable.
2. For each column, write name, description, type.
3. Add allowed values / ranges and units.
4. Define the missing-data marker per column.
5. Define every coded value and certainty flag in full.
6. Note provenance or caveats in the notes column.
7. Validate data against the dictionary; commit both together.

What does a bad data dictionary look like?

  • Listing column names with no definitions ("self-explanatory" — it never is).
  • Documenting some codes but not all, leaving silent gaps.
  • No missing-data convention, so blanks, NA and -1 mix uncontrolled.
  • A dictionary that describes an older schema than the shipped data.

Key Takeaways

  • A data dictionary defines each field: name, meaning, type, allowed values, units, missing convention.
  • It is where coded values like 1 = baptised become interpretable — never skip those.
  • Keep it as a table (Markdown for humans, CSV for machine validation).
  • Document dates, names and uncertainty conventions explicitly; these confuse reusers most.
  • Validate the dataset against its dictionary as a curation check before deposit.
  • Update the dictionary in the same commit as any schema change to prevent drift.

Frequently Asked Questions

What is a data dictionary?

A data dictionary is documentation that defines every field in a dataset: its name, meaning, data type, allowed values, units, and how missing data is recorded. It tells a reuser exactly what each column and code means.

How is a data dictionary different from a README?

A README orients the reader to the whole dataset and how files relate, while a data dictionary documents the internals of a tabular file column by column. Most well-curated datasets include both.

What columns should a data dictionary table have?

At minimum: field name, description, data type, allowed values or value range, units where relevant, and the missing-data convention. Add a notes column for provenance or coding caveats specific to that field.

What format should a data dictionary be in?

A Markdown or CSV table is the most portable choice. CSV has the advantage of being machine-readable and lets you validate the data against it, while Markdown reads cleanly for humans in a repository.

Do I need a data dictionary for a small dataset?

Yes, even small datasets benefit, because coded values and the meaning of ambiguous columns are forgotten within months. A five-column table can be documented in minutes and saves hours of future confusion.

How do I document coded or categorical values?

List every code and its meaning explicitly. A column holding 1, 2 and 3 is meaningless until the dictionary records that 1 = baptised, 2 = received, 3 = uncertain, alongside how you decided.