Skip to content
Reproducible Humanities Research

Organise your folder structure as soon as a project has stable inputs, more than one contributor, or a pipeline that will run more than once — and keep it minimal before that point. A deliberate structure pays for itself by making outputs rebuildable, paths predictable, and the project legible to collaborators and your future self. But imposing an elaborate hierarchy during early one-off exploration is premature optimisation that only adds friction.

This article is about the decision: when structure earns its cost, when it does not, and what the universal principle is regardless of project type.

When does a folder structure actually pay off?

Structure becomes worth the effort when one of three signals appears. First, your inputs stabilise — once you are no longer just poking at three files, a place for raw sources matters. Second, someone else joins — the moment a second person needs to find things, shared conventions stop being optional. Third, the pipeline repeats — if a script will run more than once, predictable paths prevent it breaking every time a file moves. Before any of those, a flat folder is genuinely fine.

What is the one principle that always holds?

Whatever the project type, separate raw inputs from processed outputs, and make raw read-only:

project/
├── data/
│   ├── raw/         # original, untouched — never edited by hand
│   └── processed/   # everything your scripts generate
├── scripts/
├── outputs/         # figures, tables, exports
└── README.md

Raw is sacred. You never edit it; you only read from it and write derivatives elsewhere. This single discipline means you can delete processed/ and outputs/ and rebuild them from raw/ at any time, which is the heart of reproducibility. On Unix you can even enforce it: chmod -R a-w data/raw.

When is structure premature?

If you are exploring a handful of letters to see whether a research question is viable, a deep folder tree is overhead with no payoff. You will spend more time deciding where files go than analysing them, and you will likely reorganise once the real shape emerges anyway. During genuine exploration, keep everything flat and visible, and graduate to structure when the project commits to a direction.

How does the right structure differ by project type?

There is no universal layout because projects have different centres of gravity:

Project typeCentre of gravityStructure leans toward
Dataset / quantitativedata/raw and data/processedstrict raw/processed split, data dictionary
Scholarly editiontranscriptions/, tei/per-source folders, validation outputs
Tool / code-heavysrc/, tests/conventional software layout, packaging
Mixed DHall of the aboveshallow top level, clear domain folders

Borrow conventions from your project's type rather than forcing a generic template onto it.

Should I reorganise a messy existing project?

Restructuring has real costs: it breaks hard-coded paths in scripts, churns the Git history, and confuses collaborators mid-stride. Reorganise only when the mess is actively costing you — you keep losing files, new contributors cannot navigate it, or scripts break because nothing is where expected. A near-finished solo project that works rarely justifies the disruption. When you do restructure, do it in one commit, update every path, and rerun the pipeline to confirm nothing broke.

What naming conventions reduce friction later?

Lowercase, hyphenated, no spaces, and where order matters, zero-padded numeric or ISO-date prefixes:

01-import.py   02-clean.py   03-analyse.py
1841-census-raw.csv   2025-04-report.md

Sortable, scriptable, and unambiguous across operating systems. Spaces and inconsistent case are the small daily tax you avoid by deciding once.

Key Takeaways

  • Adopt a structure when inputs stabilise, a collaborator joins, or a pipeline repeats — not before.
  • The universal rule is to separate raw inputs from processed outputs and keep raw read-only.
  • A read-only raw folder makes the whole project rebuildable from source, which is the core of reproducibility.
  • Elaborate structure during early exploration is premature and adds friction without payoff.
  • Match the layout to your project's type — dataset, edition or tool — rather than a generic template.
  • Restructure existing projects only when the mess is actively costing time or blocking people.

Frequently Asked Questions

Is there a single correct folder structure?

No. The right structure depends on whether your project is data-heavy, edition-heavy or code-heavy. The principle that holds everywhere is separating raw inputs from processed outputs.

When is imposing a structure premature?

During pure exploration of a few files, an elaborate structure adds friction without payoff. Adopt a structure once the project has stable inputs or more than one contributor.

Why separate raw data from processed data?

So you can always rebuild outputs from untouched sources. A read-only raw folder protects your originals and makes the whole pipeline rerunnable from scratch.

Should I restructure an existing messy project?

Only if the mess is actively costing you time or blocking collaborators. A working solo project near completion rarely justifies the disruption of a big reorganisation.

How do folder conventions help reproducibility?

Predictable paths let scripts and collaborators find files without guessing, and a clean raw-to-processed flow makes the route from source to result auditable and rerunnable.