Appearance
When a BagIt bag fails, the cause is almost always a mismatch between the payload and the manifests: a file was edited, added, removed, or a stray system file appeared after bagging. BagIt (RFC 8493) wraps your content in a data/ directory alongside checksum manifests and tag files, so any drift between what is on disk and what the manifest records surfaces immediately at validation. Fix it by finding which file changed, then regenerating the manifests rather than hand-editing them.
What is in a bag, and why does that matter for debugging?
A minimal bag looks like this:
text
my-bag/
bagit.txt # version + encoding declaration
bag-info.txt # metadata + Payload-Oxum
manifest-sha256.txt # checksum of every payload file
tagmanifest-sha256.txt # checksum of the tag files themselves
data/
document.pdf
images/scan001.tifValidation compares each file under data/ to manifest-sha256.txt, and the tag files to tagmanifest-sha256.txt. Knowing this, most errors decode instantly: a payload error points at data/, a tag error points at the metadata files.
Why does my bag fail validation after editing a file?
Because BagIt is intentionally rigid. Any byte change to a payload file alters its checksum, so the manifest no longer matches and validation fails — this is the feature working, not a bug. Never edit inside data/ and expect the bag to stay valid. Regenerate instead:
python
import bagit
bag = bagit.Bag("my-bag")
bag.save(manifests=True) # recompute manifests after legitimate changes
bag.validate() # raises BagValidationError on any mismatchIf you only need to fix metadata, edit bag-info.txt and re-save — the tag manifest updates, the payload manifest stays intact.
Decoding the most common BagIt errors
| Symptom | Root cause | Fix |
|---|---|---|
| Payload-Oxum mismatch | Files added/removed/resized after bagging | Re-bag or bag.save(manifests=True) |
| Checksum mismatch on one file | That file was edited or corrupted in transfer | Restore from source, re-validate |
| "unexpected payload" file | Stray .DS_Store, Thumbs.db, .tmp | Delete the stray file, regenerate manifest |
| Missing file listed in manifest | File deleted or move interrupted | Recopy the file, re-validate |
| Tag manifest invalid | bag-info.txt hand-edited | Re-save the bag so the tag manifest matches |
How do I handle the hidden-file problem?
Operating systems litter directories with .DS_Store (macOS), Thumbs.db (Windows), and .AppleDouble files. If these appear after bagging, validators flag them as unexpected payload. Strip them before bagging:
bash
find ./source -name '.DS_Store' -delete
find ./source -name 'Thumbs.db' -delete
bagit.py --sha256 ./source # bag a clean treeResist the urge to add them to the manifest — they are noise and should not be in the preservation payload at all.
What about very large datasets?
For multi-terabyte payloads, a holey bag ships the manifests and a fetch.txt listing download URLs, while the payload itself is retrieved on demand. The bag validates structurally up front and completes only when the data is fetched. This is ideal when content lives on remote storage or you want to transfer metadata first. Generate with bag.save() after populating fetch.txt, and remember a holey bag is incomplete until fetched — validate with --completeness awareness.
Building a reliable bagging routine
To avoid most troubleshooting entirely: bag from a clean, read-only copy of the source; always use SHA-256; never edit inside data/; validate immediately after creation and again after any transfer; and keep a log of each validation. A bag that validates at both ends of a transfer is your proof the bits arrived intact.
Key Takeaways
- Most BagIt failures are payload-vs-manifest mismatches: a file changed after bagging.
- Never hand-edit files in
data/; regenerate manifests with the tooling instead. - A Payload-Oxum mismatch means files were added, removed, or resized.
- Strip
.DS_Store,Thumbs.dband temp files before bagging, not after. - Use holey bags with
fetch.txtfor very large or remotely hosted payloads. - Default to SHA-256 and validate at both ends of every transfer.
- Use bagit-python, bagit-java, or Bagger — all conform to RFC 8493 and interoperate.
Frequently Asked Questions
What is a BagIt bag?
A BagIt bag is a directory holding your content under a 'data/' folder plus tag files (bagit.txt, manifests, bag-info.txt) that record fixity and metadata, so a transfer's completeness and integrity can be verified anywhere.
Why does my bag fail validation after I edit a file?
Editing any file in the payload changes its checksum, so the stored manifest no longer matches; you must regenerate the manifests (re-bag or update) whenever payload contents change.
What does an 'Oxum' or payload-oxum mismatch mean?
Payload-Oxum records the total byte count and number of payload files; a mismatch means files were added, removed, or changed in size since the bag was created, so the bag is incomplete or altered.
Should I use a holey bag for large data?
A holey bag with fetch.txt lets you ship metadata and manifests while the payload is downloaded from URLs on demand, which is useful for very large or remotely hosted datasets.
Why are hidden files like .DS_Store breaking my validation?
System files created after bagging are not in the manifest, so validators report them as unexpected payload; exclude them before bagging or regenerate the manifest.
Which tool should I use to create and validate bags?
Use the Library of Congress bagit-python (bagit.py) or bagit-java for scripting, and Bagger for a GUI; all produce interoperable bags conforming to RFC 8493.