How to Design a crowdsourced transcription project

To design a crowdsourced transcription project, start from the data you actually need to extract, not from the images you happen to have. Decide on three things first: the exact output format, the smallest meaningful task, and how you will reconcile disagreement. Everything else — platform, guidelines, recruitment — follows from those decisions. Skipping this scoping step is the single most common reason projects stall at 30% completion.

What output do you actually need?

Write a one-line specification of your final dataset before touching a platform. "A CSV with columns surname, forename, baptism_date, parish" is a usable target; "transcriptions of these registers" is not. The output spec dictates whether you collect structured fields or free text, how you index, and what counts as done.

A useful exercise is to hand-transcribe ten representative pages yourself and put the result in your target format. You will discover edge cases — struck-through entries, marginalia, illegible dates — that must be handled in your guidelines. Budget a full day for this; it saves weeks later.

How should you size a single task?

The task is the unit a volunteer completes in one sitting. Keep it to 2-5 minutes. Long pages should be subdivided:

text

Dense parish register page  →  one task per ~10 entries
Multi-column ledger         →  one task per column block
Prose letter (2 pages)      →  one task per page side

Smaller tasks raise completion rates and make consensus cheaper, because you compare short strings rather than whole pages. The trade-off is context loss: a volunteer transcribing a single line cannot see the heading. Provide a zoomable full-page view alongside the marked region to mitigate this.

How many independent passes per page?

Plan for three independent transcriptions of each task. Three lets a consensus algorithm break ties; two cannot. The cost is real: at three passes, transcribing 5,000 pages means 15,000 completed tasks.

Passes	Tie-breaking	Relative effort	Best for
1	None	1.0x	Low-stakes tagging
2	No	2.0x	Clean typescript
3	Yes	3.0x	Handwritten records
4+	Marginal gain	4.0x+	Damaged or critical sources

Which platform fits your sources?

Match the platform to the document type, not to popularity. Tabular records with discrete fields suit Zooniverse's drawing-and-marking workflows or FromThePage's structured tables. Continuous prose — diaries, correspondence — suits FromThePage's page-by-page free-text editor, which keeps the transcript next to the image and supports light TEI markup.

Build-your-own is almost never worth it. The hidden costs are volunteer accounts, moderation tools, export pipelines and accessibility — all solved problems on hosted platforms.

How do you write guidelines that survive contact with volunteers?

Keep the rulebook to one page of must-follow rules plus a gallery of worked examples. State the four decisions volunteers ask about most: how to mark illegible text, whether to expand abbreviations, how to handle dates, and what to do with deletions. Show a screenshot of a real page next to its correct transcription for each rule. Vague guidelines produce inconsistent data that no amount of reconciliation can fix.

How do you pilot before launching?

Run a closed pilot with 5-10 volunteers on 50 pages. Watch where they hesitate, read the discussion forum, and time how long tasks take. You are looking for two signals: tasks consistently over five minutes (subdivide them) and the same question asked repeatedly (your guidelines have a gap). Fix both before opening the project, because launch attention is a one-time resource you cannot easily reclaim.

Key Takeaways

Specify the final dataset format first; the spec drives every later decision.
Hand-transcribe ten pages yourself to surface edge cases early.
Size tasks to 2-5 minutes and subdivide dense pages.
Default to three independent passes for handwritten material.
Use a hosted platform; building bespoke tooling rarely pays off.
Keep guidelines to one page plus worked examples, and pilot with 5-10 people first.

Frequently Asked Questions

How many volunteers do I need to start?

You need far fewer than you think. A core of 10-30 active transcribers typically produces 80% of completed work; design for a small, loyal cohort rather than a viral crowd.

Should I split images into one field per page or full free-text?

Use structured field-by-field marking for tabular records (registers, censuses, ledgers) and full free-text for prose letters and diaries. Mixing the two on one workflow confuses volunteers and lowers completion rates.

How many independent transcriptions per page should I collect?

Three is the practical default for consensus reconciliation. Two cannot break ties, and beyond four the cost rises faster than accuracy improves on most handwritten material.

How long should a single task take a volunteer?

Aim for 2-5 minutes per task. Tasks longer than about eight minutes sharply increase abandonment, so subdivide dense pages into regions or lines.

Do I need a platform or can I build my own?

For almost all projects use an existing platform such as Zooniverse or FromThePage. Building bespoke tooling costs months of developer time that is better spent on documentation and volunteer support.

What output do you actually need? ​

How should you size a single task? ​

How many independent passes per page? ​

Which platform fits your sources? ​

How do you write guidelines that survive contact with volunteers? ​

How do you pilot before launching? ​

Key Takeaways ​

Frequently Asked Questions ​

How many volunteers do I need to start? ​

Should I split images into one field per page or full free-text? ​

How many independent transcriptions per page should I collect? ​

How long should a single task take a volunteer? ​

Do I need a platform or can I build my own? ​

Related reading ​