Appearance
To design an ingest workflow, model it as a pipeline that turns a messy Submission Information Package (SIP) into a trustworthy, fully described Archival Information Package (AIP) — passing through fixed gates: quarantine and virus scan, fixity, format identification, normalisation, metadata enrichment, packaging and storage. Automate the deterministic steps and reserve human judgement for appraisal and exceptions. Get fixity in early and re-check it after every change, so you can always prove integrity.
What is ingest meant to achieve?
Ingest is the OAIS function where you accept content and commit to preserving it. It is the single most important quality gate in the whole system: errors caught here are cheap, while errors that slip past become permanent features of your archive. The goal is not just to copy files in, but to know exactly what you have, prove it is intact, and describe it well enough to find and use decades from now.
What are the stages of a good ingest pipeline?
A robust default pipeline has eight stages. You can implement them in Archivematica, or assemble them yourself from open tools.
| # | Stage | Default tool | What it produces |
|---|---|---|---|
| 1 | Quarantine + virus scan | ClamAV | Clean, isolated files |
| 2 | Generate fixity | sha256deep | Baseline manifest |
| 3 | Format identification | Siegfried / DROID | PUIDs per file |
| 4 | Validation | JHOVE | Well-formed/valid report |
| 5 | Normalisation | Format policies | Preservation masters |
| 6 | Metadata enrichment | Your CMS / PREMIS | Descriptive + technical md |
| 7 | Packaging | BagIt / METS | The AIP |
| 8 | Storage + register | Object store + DB | Stored, indexed AIP |
How do I sequence the steps without breaking integrity?
The golden rule: fixity before transformation, fixity after transformation. Capture a baseline checksum the instant files arrive, and re-verify after every step that touches bytes, so any unexpected change is attributable.
bash
# 1. Quarantine + scan
clamscan -r --move=/quarantine/clean /ingest/sip-042
# 2. Baseline fixity
sha256deep -r /quarantine/clean/sip-042 > /work/sip-042/baseline.sha256
# 3. Identify formats (Siegfried emits PRONOM PUIDs as JSON)
sf -json /quarantine/clean/sip-042 > /work/sip-042/formats.json
# 4. Validate a flagged TIFF
jhove -m TIFF-hul -h xml master_0001.tif > validation_0001.xmlIf step 4 reports a TIFF as "not well-formed," route it to a human queue rather than letting it become an AIP — that exception handling is what separates a real workflow from a script.
Where does human judgement belong?
Automate the boring, deterministic work; keep people for decisions a machine cannot defend:
- Appraisal — should this even be kept? Tools cannot judge enduring value.
- Sensitivity review — personal data, embargoes, cultural restrictions.
- Exception handling — anything the validators flag (bad formats, password-protected files, broken packages).
- Descriptive cataloguing — context the producer didn't supply.
Design the pipeline so these land in a clearly marked review queue, not buried in logs.
How should I package the AIP?
Package so the AIP is self-describing and portable — it should make sense even if pulled out of your system with no database. The common choices are BagIt (a simple, hashed directory structure) often combined with METS for structure and PREMIS for preservation metadata. A minimal BagIt AIP:
text
aip-042/
bag-info.txt
manifest-sha256.txt <- fixity for every file
data/
objects/master_0001.tif
metadata/mets.xml <- structure + descriptive
metadata/premis.xml <- events, agents, fixity historyAnyone receiving this bag can run bagit.py --validate aip-042/ and confirm integrity without your software.
What pitfalls trip up first-time designs?
- No quarantine. Scanning after files reach trusted storage is too late.
- Single checksum at the end. You then cannot tell whether corruption predated ingest.
- Normalising blindly. Keep the original alongside any normalised master; never discard the bit-stream you received.
- No event log. If you cannot answer "what did we do to this file, when, with which tool?", you have no provenance.
- Over-automation. Forcing appraisal into a rule set quietly lets junk accumulate.
How do I test the workflow before going live?
Run a representative SIP — including deliberately broken files — end to end and confirm each gate behaves: the virus scanner quarantines, the validator flags the bad TIFF, fixity catches a hand-corrupted file, and the AIP validates. A workflow you have only run on perfect inputs is untested.
Key Takeaways
- Ingest transforms a SIP into a trustworthy, described AIP — your most important quality gate.
- Use the eight-stage default: quarantine, fixity, identify, validate, normalise, describe, package, store.
- Capture fixity early and re-verify after every transformation to keep integrity provable.
- Automate deterministic steps; reserve humans for appraisal, sensitivity and exceptions.
- Always keep originals alongside normalised masters; never discard the received bit-stream.
- Package as self-describing BagIt/METS/PREMIS so the AIP survives outside your system.
- Test with deliberately broken inputs — an untested gate is no gate.
Frequently Asked Questions
What is an ingest workflow in digital preservation?
Ingest is the OAIS process of accepting a Submission Information Package (SIP), validating and enriching it, and turning it into a preservable Archival Information Package (AIP). It is the gatekeeper step where most quality problems are caught or created.
What is the difference between a SIP and an AIP?
A SIP is what the producer hands you — files plus whatever metadata they supplied. An AIP is the curated, fixity-stamped, fully described package you commit to long-term storage. Ingest is the transformation from one to the other.
Should ingest be automated or manual?
Both. Automate the deterministic, high-volume steps — virus scan, fixity, format identification, packaging — and keep human review for appraisal, sensitivity checks and exceptions the tools flag.
What tools run an ingest workflow?
Archivematica is the best-known open-source micro-services pipeline; others combine BagIt, DROID/Siegfried, ClamAV, JHOVE and a script. The principle matters more than the product: validate, identify, normalise, describe, package, store.
When should I generate fixity during ingest?
As early as possible — ideally at the producer's side or the moment files land — and again after every transformation, so you can prove nothing changed except where you intended.
What is a quarantine step and why use it?
Quarantine holds incoming files in isolation while they are virus-scanned and checked before they touch your trusted store, preventing malware or malformed packages from contaminating the archive.