Troubleshooting: Plan a digitisation workflow

When a digitisation workflow goes wrong, the symptom is almost never the cause. Projects slip, files vanish and budgets overrun because one stage was planned in isolation and the rest were bolted on later. The fix is to design the whole chain — preparation, capture, processing, metadata, QC, storage and delivery — before the first image is shot, then validate it with a small pilot. This troubleshooting guide works through the failures that actually recur and how to trace each back to its root cause.

Why does the project keep slipping its schedule?

The most common scheduling failure is treating imaging speed as the whole story. Capture is fast; the stages around it are not. Conservation assessment, foliation, target shots, metadata entry and QC frequently consume more time than the shutter clicks, and they are invisible in a "pages per day" estimate.

Diagnose it by timing every stage of a pilot batch separately:

text

Stage                 Objects   Total min   Min/object
preparation              60        420         7.0
capture                  60        180         3.0
RAW processing           60        150         2.5
metadata capture         60        390         6.5
QC + remediation         60        240         4.0
ingest/storage           60         60         1.0

If preparation and metadata dwarf capture — as above — your deadline must be driven by those stages, not the scanner's rated speed.

Why are files getting lost or mis-named between stages?

This is a handoff problem. Manual drag-and-drop, ad-hoc renaming and "I'll sort the folder later" guarantee orphaned and duplicated files. The root cause is the absence of a single persistent identifier flowing through the pipeline.

Fix it by minting the identifier at capture and deriving everything from it:

bash

# Identifier ms_042 -> sequence-padded master filenames at capture time
n=1
for raw in capture/*.cr3; do
  printf -v seq "%04d" "$n"
  mv "$raw" "masters/ms_042_${seq}.cr3"
  n=$((n+1))
done

Then move between stages with logged, scripted transfers (rsync with a manifest) rather than the Finder, so every file movement leaves a trail.

What is the most common planning mistake of all?

Designing capture first and treating metadata, QC and storage as afterthoughts. Each of those stages imposes requirements backwards onto capture: your storage budget caps bit depth and resolution; your delivery platform dictates derivative formats; your metadata schema dictates what you must record at the cradle. Plan downstream-first and the capture spec falls out naturally.

How big should the pilot be, and what should it prove?

Run 50-100 representative objects all the way through every stage — not just imaging. A pilot that size is big enough to expose throughput, handling and naming problems, and small enough to throw away and redo if the plan needs to change. The pilot should produce three things: realistic per-stage timings, a list of objects that broke the standard path, and at least one finished file that passes QC and ingests cleanly.

Why do storage costs explode partway through?

Because master file size was guessed, not measured. A common trap:

Item	Underestimate	Realistic
Master TIFF per object	40 MB	120-250 MB
Derivatives	ignored	+20-30%
Backup copies	one	two (3-2-1)
Effective storage per object	40 MB	~600-900 MB

Multiply the realistic per-object figure by the full object count and the 3-2-1 backup footprint before you commit a budget. Sizing for a single copy is the classic mid-project blowup.

How do I plan when object condition varies wildly?

Do not force fragile and stable material down one path. Triage objects into condition tiers up front. Stable, uniform items run the fast standard path; fragile, bound or oversized items route to a slower path with conservation sign-off, specialist cradles and a separate time estimate. Mixing them is why an "average" throughput figure never holds — the fragile tail dominates the schedule.

How do I make the plan resilient once it is running?

Build in checkpoints, not just a final inspection. After the pilot, freeze the spec in a one-page document covering identifiers, formats, resolution, metadata fields, QC rules and storage targets. Review actual throughput against the pilot estimate weekly; if reality diverges by more than ~15%, stop and re-diagnose rather than pushing through and absorbing the error across thousands of objects.

Key Takeaways

Plan every stage before capture; downstream requirements dictate the capture spec.
Time each stage of a pilot separately — preparation and metadata usually outweigh imaging.
Mint one persistent identifier at capture and derive all filenames and transfers from it.
Run a 50-100 object pilot end to end before scaling.
Size storage for realistic master sizes plus derivatives and the full 3-2-1 backup footprint.
Triage objects by condition and route fragile material through a separate path.
Re-diagnose when actual throughput diverges from the pilot by more than ~15%.

Frequently Asked Questions

Why does my digitisation project keep falling behind schedule?

The usual root cause is an unmeasured preparation stage: conservation checks, foliation and metadata capture often take longer than imaging itself, so estimate them separately and time a pilot batch before committing to a deadline.

What is the single most common workflow planning mistake?

Planning capture in isolation and bolting on metadata, QC and storage afterwards; every downstream stage should be designed before the first image is shot, because retrofitting them is far more expensive.

How do I stop files getting lost or mis-named between stages?

Assign a persistent identifier at the point of capture, derive every filename from it, and move files between stages with logged, scripted transfers rather than manual drag-and-drop.

How big should a pilot batch be before scaling up?

Run 50-100 representative objects through every stage end to end; that is large enough to expose throughput and handling problems but small enough to redo if the plan needs changing.

Why do storage costs blow up partway through a project?

Master TIFFs are often underestimated; multiply expected object count by realistic per-file size, add derivatives and at least two backup copies, and size storage for the full 3-2-1 footprint up front.

How do I plan a workflow when object condition varies wildly?

Triage objects into condition tiers, route fragile material through a separate slower path with conservation sign-off, and keep the standard fast path for stable, uniform material.

Why does the project keep slipping its schedule? ​

Why are files getting lost or mis-named between stages? ​

What is the most common planning mistake of all? ​

How big should the pilot be, and what should it prove? ​

Why do storage costs explode partway through? ​

How do I plan when object condition varies wildly? ​

How do I make the plan resilient once it is running? ​

Key Takeaways ​

Frequently Asked Questions ​

Why does my digitisation project keep falling behind schedule? ​

What is the single most common workflow planning mistake? ​

How do I stop files getting lost or mis-named between stages? ​

How big should a pilot batch be before scaling up? ​

Why do storage costs blow up partway through a project? ​

How do I plan a workflow when object condition varies wildly? ​

Related reading ​