Appearance
A data dictionary is the documentation that defines every field in your dataset — its name, meaning, type, allowed values, units and missing-data convention — so anyone can interpret a column or coded value without guessing. To write one, build a table with a row per field and columns for name, description, type, allowed values and notes, then fill it in as the dataset takes shape. For historians the highest-value entries are the coded values and the conventions for dates, names and uncertainty, because those are the first things memory loses.
Why does a dataset need a data dictionary?
Because columns are silent. A field called status holding 1, 2, 3 is meaningless to a colleague, a peer reviewer, or you in eighteen months. The data dictionary is where 1 = baptised, 2 = received into the church, 3 = uncertain is written down, along with how you decided. Without it, a dataset is unauditable and effectively unreusable, however clean the numbers look.
What should every data dictionary entry contain?
Use a fixed set of columns so the dictionary is itself tidy and, ideally, machine-readable.
| field_name | description | type | allowed_values | missing | notes |
|---|---|---|---|---|---|
| baptism_id | Unique record ID | integer | 1–4012 | none | primary key |
| date_norm | Normalised date | EDTF | YYYY-MM-DD | `` (empty) | from date_verbatim |
| date_verbatim | Date as written | string | free text | [illegible] | source spelling kept |
| status | Rite recorded | integer | 1,2,3 | -1 | 1=baptised,2=received,3=uncertain |
| father_occ | Father's occupation | string | free text | `` (empty) | not normalised |
Notice the dictionary itself records the missing-data convention per field — a detail that silently breaks analyses when undocumented.
How do I document dates, names and uncertainty?
These three fields cause the most downstream confusion in historical data, so be explicit:
- Dates: state the format (
EDTF/ ISO 8601), and that a verbatim column preserves the original. Document how regnal years or partial dates were converted. - Names: note whether the column is standardised or verbatim, and which authority (if any) you reconciled against.
- Uncertainty: define every certainty flag.
inferred,uncertainandillegiblemust each have a written meaning.
status = 3(uncertain) was assigned where the register entry was damaged or ambiguously worded; the verbatim text is preserved innotesfor re-judgement.
What format should I write it in?
Two good choices, often used together:
text
data_dictionary.md -> human-readable, renders in the repo
data_dictionary.csv -> machine-readable, drives validationThe CSV form lets you validate the dataset against its own dictionary. For example, with a schema tool:
python
import pandas as pd
dd = pd.read_csv("data_dictionary.csv")
df = pd.read_csv("baptisms.csv")
# Check every documented field actually exists, and vice versa
documented = set(dd["field_name"])
present = set(df.columns)
print("Undocumented columns:", present - documented)
print("Missing columns:", documented - present)A green run here means the data and its documentation agree — a small but powerful curation check.
How do I keep the dictionary in sync with the data?
The dictionary rots the moment the schema changes and the documentation does not. Three habits prevent drift:
- Edit the dictionary in the same commit that changes the schema.
- Run the sync check above in a pre-deposit script.
- Record schema changes in the dataset CHANGELOG, referencing the field affected.
A step-by-step workflow
text
1. Freeze the column list once the data shape is stable.
2. For each column, write name, description, type.
3. Add allowed values / ranges and units.
4. Define the missing-data marker per column.
5. Define every coded value and certainty flag in full.
6. Note provenance or caveats in the notes column.
7. Validate data against the dictionary; commit both together.What does a bad data dictionary look like?
- Listing column names with no definitions ("self-explanatory" — it never is).
- Documenting some codes but not all, leaving silent gaps.
- No missing-data convention, so blanks,
NAand-1mix uncontrolled. - A dictionary that describes an older schema than the shipped data.
Key Takeaways
- A data dictionary defines each field: name, meaning, type, allowed values, units, missing convention.
- It is where coded values like
1 = baptisedbecome interpretable — never skip those. - Keep it as a table (Markdown for humans, CSV for machine validation).
- Document dates, names and uncertainty conventions explicitly; these confuse reusers most.
- Validate the dataset against its dictionary as a curation check before deposit.
- Update the dictionary in the same commit as any schema change to prevent drift.
Frequently Asked Questions
What is a data dictionary?
A data dictionary is documentation that defines every field in a dataset: its name, meaning, data type, allowed values, units, and how missing data is recorded. It tells a reuser exactly what each column and code means.
How is a data dictionary different from a README?
A README orients the reader to the whole dataset and how files relate, while a data dictionary documents the internals of a tabular file column by column. Most well-curated datasets include both.
What columns should a data dictionary table have?
At minimum: field name, description, data type, allowed values or value range, units where relevant, and the missing-data convention. Add a notes column for provenance or coding caveats specific to that field.
What format should a data dictionary be in?
A Markdown or CSV table is the most portable choice. CSV has the advantage of being machine-readable and lets you validate the data against it, while Markdown reads cleanly for humans in a repository.
Do I need a data dictionary for a small dataset?
Yes, even small datasets benefit, because coded values and the meaning of ambiguous columns are forgotten within months. A five-column table can be documented in minutes and saves hours of future confusion.
How do I document coded or categorical values?
List every code and its meaning explicitly. A column holding 1, 2 and 3 is meaningless until the dictionary records that 1 = baptised, 2 = received, 3 = uncertain, alongside how you decided.