Best Practices to Document a dataset with a README

A dataset README is a single plain-text file that orients any future user to your data before they open it: what the dataset is, who made it, what every file contains, how it was produced, and how it may be reused. Best practice is to write it in Markdown, place it at the top level of the dataset, and cover seven sections — title and creators, summary, file manifest, methods and provenance, variable documentation, licence, and contact. Write it as you build the dataset, not after, so nothing is lost to memory.

Why does a dataset need a README at all?

Filenames lie. Six months from now, final_v3_clean.csv tells neither you nor a stranger what is inside, where it came from, or what status = 2 means. The README is the contract that makes a dataset self-describing and therefore reusable, citable and defensible under scrutiny. It is also the first thing a repository reviewer and a data-paper referee will read.

What sections must a good README contain?

Use a fixed skeleton so every dataset you produce looks the same. A widely adopted version is Cornell's Research Data Management template.

markdown

# Title of the dataset
Creators (with ORCIDs), affiliation, date, version, DOI

## Summary
One paragraph: what the data is, period, place, why it exists.

## Files in this dataset
- baptisms.csv   — one row per baptism, 1813–1837 (4,012 rows)
- data_dictionary.md — column definitions and coded values
- sources.csv    — archive references for each register

## Methods & provenance
How sources were selected, transcribed, transformed.

## Licence
CC BY 4.0 — https://creativecommons.org/licenses/by/4.0/

## Contact
[email protected]

How detailed should the file manifest be?

The manifest is the section reusers depend on most, so give every file a one-line description and, for data files, a row count or size. If files relate to each other (a lookup table joined to a main table), say so and name the join key. A reader should never have to open a file to learn what it is.

How do I document provenance and methods?

Provenance is what separates a credible humanities dataset from an anonymous spreadsheet. Record the chain from source to value:

The exact archive, collection and reference (e.g. "TNA, PROB 11").
The capture date and method (manual transcription, Transkribus model X).
Every transformation: normalisation, deduplication, coding decisions.
Known gaps, damaged originals, and illegible passages.

Dates were normalised to ISO 8601. Where the register gave only a regnal year, the converted year is flagged in the date_certainty column. Three folios were water-damaged and are marked [illegible].

That single note pre-empts the question every reuser would otherwise email you.

Where should the README live and what should it be called?

Conventions exist for a reason — follow them so tools and humans find the file automatically.

Decision	Best practice
Filename	`README.md` (or `README.txt`)
Location	Top level of the dataset folder/archive
Encoding	UTF-8
Multiple datasets	One README per dataset, plus an optional top-level overview
Versioning	State the dataset version and DOI in the README header

How do I keep README quality consistent across a project?

Inconsistency creeps in when each person writes ad hoc. Three habits fix it:

Adopt one template (Cornell's or your own) and store it in the project repo.
Add a README check to your deposit checklist so nothing ships undocumented.
Lint it: even a simple script can verify required headings exist.

bash

# Crude completeness check before deposit
for h in "## Summary" "## Files" "## Methods" "## Licence"; do
  grep -q "$h" README.md || echo "MISSING: $h"
done

A pre-deposit README checklist

[ ] Title, creators with ORCIDs, version, DOI
[ ] One-paragraph summary with period and place
[ ] File manifest with a line per file and row counts
[ ] Methods and full provenance chain
[ ] Pointer to the data dictionary for coded values
[ ] Explicit licence with a link
[ ] Contact and citation guidance
[ ] Plain-text format, UTF-8, at the top level

Key Takeaways

A README makes a dataset self-describing; filenames never do that on their own.
Write it in plain Markdown at the top level, in UTF-8.
The file manifest is the most-used section — give every file a description and row count.
Provenance and transformation notes pre-empt the questions reusers would email you.
State the licence in the README even if a separate LICENSE file exists.
A fixed template plus a deposit checklist keeps quality consistent across a project.

Frequently Asked Questions

What format should a dataset README be in?

Plain-text Markdown (README.md) or .txt is the safest choice because it is openly readable forever and renders on most repositories. Avoid Word or PDF for the README itself, since those add a dependency to read your documentation.

What is the one section a README must never omit?

A file manifest listing every file with a one-line description. Reusers open the README to learn what each file is; without a manifest they are left guessing from filenames alone.

How is a README different from a data dictionary?

A README orients the reader to the whole dataset: what it is, who made it, how files relate. A data dictionary documents the columns and coded values inside a tabular file. Large datasets often need both.

Should the README include the licence?

Yes, state the licence explicitly and link to its full text, even if a separate LICENSE file exists. The README is where reusers look first, and an unstated licence means others legally cannot reuse the data.

How do I document provenance in a README?

Record where each source came from, the archive reference, the date of capture, and every transformation applied. The aim is that someone could trace any value back to its origin.

Can I reuse one README template across all my datasets?

Yes, a consistent template improves quality and saves time. Cornell's Guide to Writing README files provides a widely used skeleton you can adapt to your project's conventions.

Why does a dataset need a README at all? ​

What sections must a good README contain? ​

How detailed should the file manifest be? ​

How do I document provenance and methods? ​

Where should the README live and what should it be called? ​

How do I keep README quality consistent across a project? ​

A pre-deposit README checklist ​

Key Takeaways ​

Frequently Asked Questions ​

What format should a dataset README be in? ​

What is the one section a README must never omit? ​

How is a README different from a data dictionary? ​

Should the README include the licence? ​

How do I document provenance in a README? ​

Can I reuse one README template across all my datasets? ​

Related reading ​