How to Curate Jupyter notebooks for reuse

To curate a Jupyter notebook for reuse, restart the kernel and run every cell top to bottom so the saved state is genuinely reproducible, pin the exact environment with a requirements.txt or environment.yml, use relative paths to data, and deposit the .ipynb alongside a rendered HTML export and an extracted .py script. Most "it won't run" failures come from just two causes — out-of-order execution and unpinned dependencies — so fixing those gets you most of the way.

Why won't a shared notebook just run?

A notebook's saved state reflects whatever order you happened to run cells, not the order they appear. If cell 8 depends on a variable you defined in cell 3 but deleted in cell 12, the file looks fine and breaks on a clean run. Always validate with a full restart:

bash

jupyter nbconvert --to notebook --execute --inplace analysis.ipynb

If that command errors, your notebook is not reproducible yet — fix it before anything else.

Step 1: enforce linear execution

Clear the execution counters and re-run from scratch. Inside Jupyter: Kernel → Restart & Run All. On the command line, the --execute flag above does the same and is scriptable in CI. A notebook whose cell numbers read [1], [2], [3]… in order is a quick visual signal that it ran cleanly.

Step 2: pin the environment

A notebook without a pinned environment is a guess about the future. Capture exact versions:

bash

# pip-based projects
pip freeze > requirements.txt

# conda-based projects
conda env export --no-builds > environment.yml

Record the Python version explicitly too — a notebook written for 3.9 may break on 3.12. For bullet-proof reuse, add a Dockerfile or a Binder postBuild.

Step 3: fix the data paths

Hard-coded paths like C:\Users\elara\project\data.csv are the third great reuse killer. Reference data relatively from the notebook's location:

python

from pathlib import Path
DATA = Path(__file__).resolve().parent / "data" if "__file__" in dir() else Path("data")
df = pd.read_csv(DATA / "sources.csv")

Bundle small sample data; for large datasets, reference a repository DOI in the README rather than embedding gigabytes.

Should I strip output before depositing?

It depends on the copy. Keep a clean source for version control and a rich copy for the archive:

Copy	Outputs	Purpose
Git source `.ipynb`	Stripped	Small diffs, no data leakage
Archived deposit `.ipynb`	Kept	Reusers see expected results
HTML export	Kept	Appearance preserved, opens anywhere
`.py` script	n/a	Most preservation-stable form

Strip outputs for Git with nbstripout:

bash

pip install nbstripout
nbstripout --install   # registers a git filter

What formats should I actually deposit?

Deposit a bundle, not a lone .ipynb, because the notebook format is JSON that future tools may not render. Export the durable companions:

bash

jupyter nbconvert --to html analysis.ipynb       # human-readable, opens anywhere
jupytext --to py:percent analysis.ipynb          # clean, diffable script

So the archive holds: the executed .ipynb, an analysis.html, an analysis.py, the environment file, a README, and the data (or its DOI).

How do I let others run it with zero setup?

Add a Binder configuration: place requirements.txt (or environment.yml) in the repository root and a mybinder.org badge in the README. A reuser clicks it and gets a live, browser-based kernel with your exact environment — no install. For heavier or private workflows, a Dockerfile with the pinned environment gives the same guarantee locally.

A pre-deposit checklist

text

[ ] Restart & Run All passes from a clean kernel
[ ] requirements.txt / environment.yml present and pinned
[ ] Python version recorded
[ ] No absolute paths; data referenced relatively or by DOI
[ ] README explains run order and inputs
[ ] HTML export + .py script included
[ ] Licence stated for code and data

Key Takeaways

Validate reproducibility with Restart & Run All (or nbconvert --execute) before anything else.
Pin the environment with requirements.txt or environment.yml, including the Python version.
Replace absolute paths with relative ones; reference large data by DOI.
Strip outputs in Git but keep them in the archived copy.
Deposit a bundle: executed .ipynb, HTML export, .py script, environment file, README.
Add Binder or a Dockerfile so others run it with no setup.
Out-of-order execution and unpinned dependencies cause most failures.

Frequently Asked Questions

Why do shared Jupyter notebooks so often fail to run?

The two most common causes are non-linear execution, where cells were run out of order so the saved state is irreproducible, and missing pinned dependencies. Both are fixed by restarting and running top to bottom and by exporting an environment file.

Should I strip output before depositing a notebook?

Usually yes for code repositories, because outputs bloat diffs and can leak data, but keep outputs in the archived deposit copy so reusers see expected results. A clean source plus a rendered HTML export gives both.

How do I pin dependencies for a notebook?

Export an environment.yml with conda or a requirements.txt with pip freeze, pinning exact versions. For full reproducibility, capture the Python version too and consider a Dockerfile or Binder configuration.

What is the best file format to archive a notebook in?

Deposit the .ipynb plus a rendered HTML or PDF export and a plain .py script extracted with jupytext or nbconvert. The .ipynb is the working object, the HTML preserves appearance, and the script is the most preservation-stable.

Should data live inside the notebook or alongside it?

Alongside it, in a data folder referenced by relative paths, never hard-coded absolute paths. Small sample data can be bundled; large data should be referenced by DOI so the notebook stays portable.

How do I make a notebook run on someone else's machine without setup?

Add a Binder configuration or a Dockerfile so the environment builds automatically. Binder turns a public repository into a runnable session in the browser with no local install.

Why won't a shared notebook just run? ​

Step 1: enforce linear execution ​

Step 2: pin the environment ​

Step 3: fix the data paths ​

Should I strip output before depositing? ​

What formats should I actually deposit? ​

How do I let others run it with zero setup? ​

A pre-deposit checklist ​

Key Takeaways ​

Frequently Asked Questions ​

Why do shared Jupyter notebooks so often fail to run? ​

Should I strip output before depositing a notebook? ​

How do I pin dependencies for a notebook? ​

What is the best file format to archive a notebook in? ​

Should data live inside the notebook or alongside it? ​

How do I make a notebook run on someone else's machine without setup? ​

Related reading ​

Why won't a shared notebook just run?

Step 1: enforce linear execution

Step 2: pin the environment

Step 3: fix the data paths

Should I strip output before depositing?

What formats should I actually deposit?

How do I let others run it with zero setup?

A pre-deposit checklist

Key Takeaways

Frequently Asked Questions

Why do shared Jupyter notebooks so often fail to run?

Should I strip output before depositing a notebook?

How do I pin dependencies for a notebook?

What is the best file format to archive a notebook in?

Should data live inside the notebook or alongside it?

How do I make a notebook run on someone else's machine without setup?

Related reading