Best Practices to Run a reproducibility checklist

To run a reproducibility checklist, you verify a fixed set of conditions — open data (or a documented reason it isn't), versioned code, a pinned environment, a single command that rebuilds every result, and outputs that match the published ones — and you confirm them on a machine that has never seen the project. The strongest single check is that clean-machine rebuild: if a stranger follows your README and one command regenerates your headline figures, you are reproducible; if not, the checklist tells you precisely which condition failed. Treat the list as a graded scale, not a moral pass/fail.

What does a working DH reproducibility checklist contain?

Keep it short enough to actually run. These items, in order, cover the lifecycle:

#	Condition	How you verify it
1	Data available or access documented	DOI/link present; restrictions stated
2	Code under version control	Tagged release in Git
3	Environment pinned	Lock file with hashes committed
4	One command rebuilds outputs	`make all` or a run script
5	Outputs match published ones	Diff against stored figures/tables
6	Provenance recorded	Transformation log / checksums
7	Citation and licence present	`CITATION.cff`, licence file

How do I actually run the clean-machine test?

Do not trust your own laptop — it carries hidden state. Use a fresh container so only what you committed is available:

bash

docker run --rm -it -v "$PWD":/p -w /p python:3.11-slim bash
# inside, follow the README verbatim:
pip install --require-hashes -r requirements.txt
make all

Then compare. If your pipeline writes figures/ and tables/, store reference copies and diff:

bash

# numeric tables should match exactly; commit expected outputs
diff <(sort tables/summary.csv) <(sort expected/summary.csv) && echo "PASS"

Any line you had to improvise, or any output that differs, is a failing checklist item with a clear fix.

Why does reproducibility differ for humanities data?

Because perfect openness is not always possible. Archival material may be in copyright, personal data may be restricted, and a licensed corpus cannot be redistributed. The checklist handles this honestly: item 1 is satisfied by documenting the restriction and providing a synthetic or sample subset so the method still reproduces even when the full data cannot. Reproducibility of the pipeline and reproducibility of the data are separate goals — aim for the highest level each constraint allows.

How do I grade the result instead of pass/fail?

Borrow the bronze/silver/gold idea:

text

Bronze — code + data archived with a DOI, README present
Silver — pinned environment + one-command rebuild that runs clean
Gold   — automated CI rebuild on a sample + outputs verified against expected

This gives a project something to aim at and lets reviewers see where it stands, rather than a binary that discourages partial-but-real progress.

Can I make the checklist run itself?

Yes, and you should for anything long-lived. A continuous integration job converts the manual checklist into an automatic guard:

yaml

jobs:
  reproduce:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install --require-hashes -r requirements.txt
      - run: make smoke          # fast subset of the pipeline
      - run: python verify_outputs.py   # compares to expected/

Now every commit that breaks reproducibility fails the build on the spot, instead of surprising you years later when a reviewer tries to re-run the work.

Key Takeaways

A reproducibility checklist is a short fixed list that converts a hope into a verified yes/no.
The clean-machine rebuild is the strongest single test — run it in a fresh container, not your laptop.
Cover data, version control, pinned environment, one-command rebuild, output match, provenance, citation/licence.
For restricted humanities data, document the restriction and ship a sample so the method still reproduces.
Grade with a bronze/silver/gold scale rather than a discouraging pass/fail binary.
Automate the checklist in CI so reproducibility regressions fail the build immediately.

Frequently Asked Questions

What is a reproducibility checklist?

It is a short, fixed list of conditions a project must meet before you call a result reproducible — data available, code versioned, environment pinned, one command to rebuild outputs, and results that match. It turns 'I think it reproduces' into a verified yes or no.

What is the single strongest test of reproducibility?

The clean-machine test: on a computer that has never seen the project, follow the README and run one command; if it produces the same headline figures and tables, it reproduces. Everything else supports passing that test.

What is the difference between reproducible and replicable?

Reproducible means the same data and code give the same result. Replicable means a new study with new data reaches a compatible conclusion. A checklist targets reproducibility, which is the part fully in your control.

How often should I run the checklist?

At least at submission and before any public release or archive deposit, and ideally automatically on every change via continuous integration so regressions surface immediately.

Do I need every item to pass for the work to count?

No. Treat the checklist as a graded scale, not pass/fail. Some sensitive archival data can't be shared openly; document the restriction and aim for the highest achievable level rather than abandoning the effort.

Can I automate a reproducibility check?

Yes. A continuous integration job that builds the pinned environment, runs the pipeline on a sample and compares outputs to stored expected values catches most regressions without manual effort.

What does a working DH reproducibility checklist contain? ​

How do I actually run the clean-machine test? ​

Why does reproducibility differ for humanities data? ​

How do I grade the result instead of pass/fail? ​

Can I make the checklist run itself? ​

Key Takeaways ​

Frequently Asked Questions ​

What is a reproducibility checklist? ​

What is the single strongest test of reproducibility? ​

What is the difference between reproducible and replicable? ​

How often should I run the checklist? ​

Do I need every item to pass for the work to count? ​

Can I automate a reproducibility check? ​

Related reading ​