Appearance
Getting started with born-digital archiving means building a repeatable pipeline that carries you from physical media or a file transfer to a stable, documented, backed-up package. The core sequence is fixed: capture a forensic copy first, verify it with a checksum, identify the file formats, appraise and arrange what matters, then wrap the result in a self-describing container. Everything else is detail. This guide walks that arc with concrete tools and sensible defaults so your very first accession is done correctly rather than re-done later.
What does "born-digital" actually mean for your workflow?
Born-digital records were authored in a digital environment and have no analogue master: emails, CAD drawings, databases, phone photographs, drafts in a word processor. Because the bits are the record, your task is to keep those bits intact and interpretable for decades. That reframes the work away from "copy the files" and toward preserving each file alongside its metadata, structure and enough context to render it again in future.
What should you do before you open a single file?
Stop and image the source first. The moment you double-click a folder on the original drive, the operating system updates last-accessed times and may write thumbnails or recovery data, destroying evidence and provenance. Connect the media through a hardware write blocker, then create a forensic image. On a Linux or BitCurator workstation, Guymager gives you a guided capture; the command-line equivalent is:
bash
# Capture an Expert Witness Format image with verification
ewfacquire /dev/sdb \
--case "ACC-2024-017" \
--evidence "JonesArchive-floppy-03" \
--format encase6 \
--compression bestThe .E01 format embeds the hash and case metadata inside the image, which is why archivists prefer it over a raw dd dump for accessioned material.
How do you verify nothing changed?
Fixity is the heartbeat of preservation. Generate a manifest of cryptographic hashes immediately after capture and you can prove integrity at every later step.
bash
hashdeep -c sha256 -r ./ACC-2024-017 > ACC-2024-017.sha256
# Audit against the manifest months later
hashdeep -a -k ACC-2024-017.sha256 -r ./ACC-2024-017Store the manifest separately from the data so a single corrupted folder cannot quietly take its own checksums down with it.
How do you know what formats you are holding?
You cannot preserve what you cannot name. Run a format identification pass with DROID (from The National Archives, UK) or Siegfried, which match files against the PRONOM registry by signature rather than trusting the extension.
bash
sf -csv ./ACC-2024-017 > formats.csvThe output tells you which formats are at risk, which lack a confident match, and where extensions lie. This single report drives your later migration and access decisions.
A minimal starter toolkit
| Task | Free tool | Output |
|---|---|---|
| Disk imaging | Guymager / ewfacquire | .E01 forensic image |
| Fixity | hashdeep / md5deep | SHA-256 manifest |
| Format ID | Siegfried / DROID | CSV of PRONOM IDs |
| Sensitive-data scan | bulk_extractor | Feature reports |
| Packaging | BagIt (bagit.py) | Validated bag |
All five ship inside the BitCurator environment, so you can install one virtual machine and have the whole chain ready.
How do you package the result so it survives?
Wrap the appraised material in a BagIt bag. The format adds a payload manifest, byte-count tag and optional metadata file, and any tool can validate it years later.
bash
bagit.py --sha256 --contact-name "Elara Reed" \
--source-organization "Digital Relics" ./ACC-2024-017
bagit.py --validate ./ACC-2024-017Then apply a 3-2-1 backup rule: three copies, two media types, one off-site. A bag on a single drive is not preserved; it is merely waiting to fail.
Common pitfalls to avoid
- Browsing the original media before imaging, which silently rewrites metadata.
- Trusting file extensions instead of signature-based identification.
- Capturing data but never recording who did it, when and with which tool version.
- Treating one copy on one disk as a finished archive.
- Deferring sensitive-data review until after access copies are published.
Key Takeaways
- Image first, work from the copy; never touch the original directly.
- Capture fixity hashes at acquisition and re-verify them on a schedule.
- Identify formats by signature with Siegfried or DROID, not by extension.
- Budget two to three times the raw size for images, derivatives and backups.
- Record full provenance: tool, version, operator, date and hash.
- Package in BagIt and follow 3-2-1 before you call anything preserved.
Frequently Asked Questions
What is the difference between born-digital and digitised material?
Born-digital records were created in digital form and have no analogue original, such as emails, spreadsheets or word-processor files. Digitised material is a scan or photograph of a physical object, so its preservation priorities differ.
What is the single most important first step?
Make a write-blocked, bit-level copy of the source media before you open or browse anything. Touching files on the original updates timestamps and can overwrite deleted data you may need to keep.
Do I need expensive forensic software to begin?
No. A modest starter stack of free tools, such as Guymager, DROID, hashdeep and BagIt, covers acquisition, identification, fixity and packaging for most small collections.
How much storage should I budget for?
Plan for roughly two to three times the size of the original data once you account for a forensic image, normalised derivatives and at least two backup copies under a 3-2-1 strategy.
What metadata do I capture at the start?
Record the accession identifier, source media type and serial, capture date, tool and version, the hash algorithm and value, and who performed the work. This provenance trail is what makes the archive defensible later.
Should I keep deleted or hidden files?
Capture them in the disk image by default, then decide during appraisal what to retain. You cannot recover what you discard, but you can always restrict or delete later.