Skip to content
Research Data Curation

To curate a Jupyter notebook for reuse, restart the kernel and run every cell top to bottom so the saved state is genuinely reproducible, pin the exact environment with a requirements.txt or environment.yml, use relative paths to data, and deposit the .ipynb alongside a rendered HTML export and an extracted .py script. Most "it won't run" failures come from just two causes — out-of-order execution and unpinned dependencies — so fixing those gets you most of the way.

Why won't a shared notebook just run?

A notebook's saved state reflects whatever order you happened to run cells, not the order they appear. If cell 8 depends on a variable you defined in cell 3 but deleted in cell 12, the file looks fine and breaks on a clean run. Always validate with a full restart:

bash
jupyter nbconvert --to notebook --execute --inplace analysis.ipynb

If that command errors, your notebook is not reproducible yet — fix it before anything else.

Step 1: enforce linear execution

Clear the execution counters and re-run from scratch. Inside Jupyter: Kernel → Restart & Run All. On the command line, the --execute flag above does the same and is scriptable in CI. A notebook whose cell numbers read [1], [2], [3]… in order is a quick visual signal that it ran cleanly.

Step 2: pin the environment

A notebook without a pinned environment is a guess about the future. Capture exact versions:

bash
# pip-based projects
pip freeze > requirements.txt

# conda-based projects
conda env export --no-builds > environment.yml

Record the Python version explicitly too — a notebook written for 3.9 may break on 3.12. For bullet-proof reuse, add a Dockerfile or a Binder postBuild.

Step 3: fix the data paths

Hard-coded paths like C:\Users\elara\project\data.csv are the third great reuse killer. Reference data relatively from the notebook's location:

python
from pathlib import Path
DATA = Path(__file__).resolve().parent / "data" if "__file__" in dir() else Path("data")
df = pd.read_csv(DATA / "sources.csv")

Bundle small sample data; for large datasets, reference a repository DOI in the README rather than embedding gigabytes.

Should I strip output before depositing?

It depends on the copy. Keep a clean source for version control and a rich copy for the archive:

CopyOutputsPurpose
Git source .ipynbStrippedSmall diffs, no data leakage
Archived deposit .ipynbKeptReusers see expected results
HTML exportKeptAppearance preserved, opens anywhere
.py scriptn/aMost preservation-stable form

Strip outputs for Git with nbstripout:

bash
pip install nbstripout
nbstripout --install   # registers a git filter

What formats should I actually deposit?

Deposit a bundle, not a lone .ipynb, because the notebook format is JSON that future tools may not render. Export the durable companions:

bash
jupyter nbconvert --to html analysis.ipynb       # human-readable, opens anywhere
jupytext --to py:percent analysis.ipynb          # clean, diffable script

So the archive holds: the executed .ipynb, an analysis.html, an analysis.py, the environment file, a README, and the data (or its DOI).

How do I let others run it with zero setup?

Add a Binder configuration: place requirements.txt (or environment.yml) in the repository root and a mybinder.org badge in the README. A reuser clicks it and gets a live, browser-based kernel with your exact environment — no install. For heavier or private workflows, a Dockerfile with the pinned environment gives the same guarantee locally.

A pre-deposit checklist

text
[ ] Restart & Run All passes from a clean kernel
[ ] requirements.txt / environment.yml present and pinned
[ ] Python version recorded
[ ] No absolute paths; data referenced relatively or by DOI
[ ] README explains run order and inputs
[ ] HTML export + .py script included
[ ] Licence stated for code and data

Key Takeaways

  • Validate reproducibility with Restart & Run All (or nbconvert --execute) before anything else.
  • Pin the environment with requirements.txt or environment.yml, including the Python version.
  • Replace absolute paths with relative ones; reference large data by DOI.
  • Strip outputs in Git but keep them in the archived copy.
  • Deposit a bundle: executed .ipynb, HTML export, .py script, environment file, README.
  • Add Binder or a Dockerfile so others run it with no setup.
  • Out-of-order execution and unpinned dependencies cause most failures.

Frequently Asked Questions

Why do shared Jupyter notebooks so often fail to run?

The two most common causes are non-linear execution, where cells were run out of order so the saved state is irreproducible, and missing pinned dependencies. Both are fixed by restarting and running top to bottom and by exporting an environment file.

Should I strip output before depositing a notebook?

Usually yes for code repositories, because outputs bloat diffs and can leak data, but keep outputs in the archived deposit copy so reusers see expected results. A clean source plus a rendered HTML export gives both.

How do I pin dependencies for a notebook?

Export an environment.yml with conda or a requirements.txt with pip freeze, pinning exact versions. For full reproducibility, capture the Python version too and consider a Dockerfile or Binder configuration.

What is the best file format to archive a notebook in?

Deposit the .ipynb plus a rendered HTML or PDF export and a plain .py script extracted with jupytext or nbconvert. The .ipynb is the working object, the HTML preserves appearance, and the script is the most preservation-stable.

Should data live inside the notebook or alongside it?

Alongside it, in a data folder referenced by relative paths, never hard-coded absolute paths. Small sample data can be bundled; large data should be referenced by DOI so the notebook stays portable.

How do I make a notebook run on someone else's machine without setup?

Add a Binder configuration or a Dockerfile so the environment builds automatically. Binder turns a public repository into a runnable session in the browser with no local install.