Troubleshooting: Use Docker for reproducible DH

When a Dockerised DH pipeline breaks, the cause is nearly always one of four things: an unpinned image or package version that drifted, data baked into the image instead of mounted at run time, a busted layer cache, or a file-permission mismatch between container and host. Fix these at the root — pin versions, mount data as volumes, order your Dockerfile for caching, and align user IDs — and your container becomes the dependable, reproducible environment it was supposed to be.

This is a troubleshooting guide, not an introduction: it assumes you have a Dockerfile that misbehaves and want the real cause fast.

Why does it work on my machine but break for a collaborator?

The symptom is a build that succeeded last year producing different output now. The root cause is almost always a floating tag. FROM python:3 or pip install pandas resolves to whatever is newest at build time. Freeze everything:

dockerfile

# Pin the base image to a specific minor version and, ideally, a digest
FROM python:3.11.9-slim@sha256:1c8b...

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

With a requirements.txt that pins exact versions (pandas==2.2.1), the image is reproducible for years, not weeks. The digest after @sha256: is the strongest guarantee because it locks the exact base image bytes.

Why can't the container find my source files?

You run the container and your script reports a missing CSV that plainly exists. The mistake is expecting host files to be inside the image. They are not. Containers see only what you copy in or mount at run time. Mount your data as a volume:

bash

docker run --rm \
  -v "$(pwd)/data:/app/data" \
  my-dh-pipeline python clean.py

Never COPY gigabytes of TIFFs into the image — it makes the image huge and ties data to code. Keep data on the host and mount it; the image stays small and the same image runs against different datasets.

Why are my output files owned by root?

After a run, files the container wrote are owned by root and you cannot edit them. This is a UID mismatch on Linux. Run the container as your own user:

bash

docker run --rm -u "$(id -u):$(id -g)" \
  -v "$(pwd)/data:/app/data" \
  my-dh-pipeline python export.py

Or create a non-root user in the Dockerfile with USER. Either way, outputs land with sensible ownership and your editor can touch them.

Why is every build so slow?

If trivial code edits trigger a full reinstall of every dependency, you have ordered the Dockerfile against the cache. Docker caches layers top to bottom and invalidates everything below a changed line. Put the slow, rarely-changing steps first:

dockerfile

COPY requirements.txt .          # changes rarely
RUN pip install -r requirements.txt
COPY . .                         # changes constantly — keep last

Now editing a script only re-runs the final cheap layer. A 6-minute rebuild drops to seconds.

How do I read the error and find the real cause?

Work the symptoms methodically:

Symptom	Likely cause	Fix
`ModuleNotFoundError` despite install	wrong interpreter / venv shadowing	check the `FROM` and `PATH`, drop stray venvs
Output differs run to run	unpinned version	pin base image and packages
`No such file or directory` for data	not mounted	add `-v host:container`
Files owned by root	UID mismatch	run with `-u`, or add a `USER`
Rebuild reinstalls everything	cache-busting layer order	copy deps before code

What is the minimal reproducible setup worth committing?

Commit the Dockerfile, the pinned requirements.txt, and a tiny Makefile or run.sh that documents the exact build and run commands. Do not commit the built image — it is a large binary that belongs in a registry, not in Git. With those three small text files in the repository, anyone can rebuild your exact environment from scratch.

Key Takeaways

Most Docker reproducibility failures trace to unpinned versions — pin the base image (ideally by digest) and every package.
Mount data as a volume at run time; never bake large source media into the image.
Fix root-owned outputs by running with -u "$(id -u):$(id -g)" or a non-root USER.
Order the Dockerfile so dependencies install before code is copied, preserving the layer cache.
Diagnose by symptom: missing module, drifting output, missing data, root ownership and slow builds each have a distinct root cause.
Commit the Dockerfile and pinned requirements, not the built image.

Frequently Asked Questions

Why does my container work today but break next month?

Almost always an unpinned tag. Base images and apt or pip packages drift when you use latest or unversioned installs. Pin exact versions and ideally a digest to freeze the build.

Why can the container not see my manuscript files?

Your data lives on the host, not in the image. Mount it with a volume flag such as -v at run time; data baked into the image is the wrong pattern and bloats it.

Should I commit my image or my Dockerfile?

Commit the Dockerfile. It is small, readable text that rebuilds the image. The built image is a large binary that does not belong in Git.

Why is my Docker build painfully slow every time?

You are probably busting the layer cache by copying everything before installing dependencies. Copy and install the requirements file first, then copy your code.

Do I need a GPU container for OCR or NLP models?

Only if you run heavy neural models locally. Many DH pipelines are CPU-bound; add GPU support and the NVIDIA runtime only when a model genuinely needs it.

Why does it work on my machine but break for a collaborator? ​

Why can't the container find my source files? ​

Why are my output files owned by root? ​

Why is every build so slow? ​

How do I read the error and find the real cause? ​

What is the minimal reproducible setup worth committing? ​

Key Takeaways ​

Frequently Asked Questions ​

Why does my container work today but break next month? ​

Why can the container not see my manuscript files? ​

Should I commit my image or my Dockerfile? ​

Why is my Docker build painfully slow every time? ​

Do I need a GPU container for OCR or NLP models? ​

Related reading ​