How to Scope a digital humanities project

To scope a digital humanities project, define a bounded research question, a concrete dataset slice you can process end to end in days rather than years, and a short, explicit list of deliverables. The discipline is subtraction: a good scope says no to most of the archive so you can finish a useful, documented result. Everything else is negotiable; those three things are not.

What does "scope" actually mean in a DH project?

Scope is the agreed boundary around three variables: the sources you will use, the questions you will answer, and the outputs you will produce. In practice DH projects fail not because the question is wrong but because the boundary is missing — the corpus quietly grows, the method becomes more ambitious, and the deadline arrives with no shippable artefact.

A tight scope statement reads like a contract you can check against later:

We will transcribe and geocode the 1881–1891 parish burial registers for three Devon parishes (~420 pages), publish a CSV plus a TEI sample, and answer whether burial clustering shifted after the 1885 cholera scare.

Notice what is excluded: every other parish, every other decade, and any claim the data cannot support.

How do I turn a vague idea into a bounded question?

Run a quick data survey before committing. Pull a random sample and inventory it:

bash

# sample 40 items from a folder of scans to estimate condition and effort
ls scans/ | shuf -n 40 > sample.txt
wc -l sample.txt   # confirm count

For each sampled item, record: legibility (1–5), structure (free text vs tabular), and language. If a third of your sample is illegible or in an unexpected hand, that changes your method — better to know on day two than month six.

How do I size the dataset realistically?

Estimate by timing, not guessing. Process five representative items by hand and record minutes per item, then project.

Task	Minutes per item	500 items	5,000 items
Manual transcription (1 page)	12	~100 hrs	~1,000 hrs
Metadata entry (1 record)	4	~33 hrs	~333 hrs
HTR review and correction	5	~42 hrs	~417 hrs
Geocoding a place name	2	~17 hrs	~167 hrs

If the projection exceeds your available person-hours, you do not have a resourcing problem — you have a scoping problem. Cut the corpus, not the documentation.

What should the deliverables list contain?

Name three artefacts at minimum: a dataset (with a stated format like CSV, TEI-XML or GeoJSON), an access point (a repository DOI or a static site), and a documentation artefact (README, data dictionary, methods note). A scope that says "we will analyse the letters" with no named output is a wish, not a plan.

How do I write the one-page scope document?

Use three columns and keep it to a single page so non-technical partners can read it in two minutes:

markdown

## In scope
- 1881–1891 burial registers, 3 named parishes
- Transcription + place-name geocoding
- CSV + 20-page TEI sample + README

## Out of scope
- Marriage and baptism registers
- OCR of printed indexes
- A custom web map (use existing GeoJSON viewers)

## Assumptions
- Scans already exist at 300 dpi
- One researcher, 0.4 FTE, 6 months

Revisit this document at every milestone. When a new request arrives ("can we also add the marriage registers?"), it goes into a backlog, not the current scope.

What are the classic DH scoping pitfalls?

Boil-the-ocean corpus selection — the whole archive instead of a slice.
Tool-first thinking — choosing a platform before knowing the data shape.
No defined "done" — analysis with no output format named.
Ignoring the long tail — cleaning and documentation are usually half the effort and rarely scoped.
Solo heroics — assuming one person will sustain a tool indefinitely.

Key Takeaways

Scope is a boundary around sources, questions and outputs — name all three.
Pick a slice you can process end to end in days; coverage comes later.
Estimate effort by timing real items, not by intuition.
Every scope must name a dataset format, an access point and a documentation artefact.
Keep a one-page in/out/assumptions document and check every milestone against it.
New requests go to a backlog, never silently into the current scope.

Frequently Asked Questions

What is the single most common scoping mistake in DH?

Treating the whole archive as the dataset. Define a bounded subset by date, collection or document type first, then expand only after you have shipped a working slice.

How big should a first DH dataset be?

Small enough to process end to end in a week. For transcription that is roughly 200 to 500 pages; for tabular data, a few thousand rows. The goal is a complete pipeline, not coverage.

How do I scope a project with an unknown corpus size?

Take a 30 to 50 item random sample, time how long each processing step takes, then multiply by your estimated total. A rough count beats no count.

Should the research question come before or after the data survey?

Iterate between them. Start with a tentative question, run a small data survey, then sharpen the question to what the sources can actually support.

What deliverables should every DH scope name explicitly?

A dataset format, an access mechanism (repository or site), and a documentation artefact such as a README or data dictionary. If a deliverable is not named, it will not be funded or finished.

How do I write a scope so non-technical stakeholders understand it?

Use a one-page scope with three lists: in scope, out of scope, and assumptions. Avoid tool names; describe outcomes the funder or partner can recognise.

What does "scope" actually mean in a DH project? ​

How do I turn a vague idea into a bounded question? ​

How do I size the dataset realistically? ​

What should the deliverables list contain? ​

How do I write the one-page scope document? ​

What are the classic DH scoping pitfalls? ​

Key Takeaways ​

Frequently Asked Questions ​

What is the single most common scoping mistake in DH? ​

How big should a first DH dataset be? ​

How do I scope a project with an unknown corpus size? ​

Should the research question come before or after the data survey? ​

What deliverables should every DH scope name explicitly? ​

How do I write a scope so non-technical stakeholders understand it? ​

Related reading ​