Skip to content
Digital Preservation

When a BagIt bag fails, the cause is almost always a mismatch between the payload and the manifests: a file was edited, added, removed, or a stray system file appeared after bagging. BagIt (RFC 8493) wraps your content in a data/ directory alongside checksum manifests and tag files, so any drift between what is on disk and what the manifest records surfaces immediately at validation. Fix it by finding which file changed, then regenerating the manifests rather than hand-editing them.

What is in a bag, and why does that matter for debugging?

A minimal bag looks like this:

text
my-bag/
  bagit.txt              # version + encoding declaration
  bag-info.txt           # metadata + Payload-Oxum
  manifest-sha256.txt    # checksum of every payload file
  tagmanifest-sha256.txt # checksum of the tag files themselves
  data/
    document.pdf
    images/scan001.tif

Validation compares each file under data/ to manifest-sha256.txt, and the tag files to tagmanifest-sha256.txt. Knowing this, most errors decode instantly: a payload error points at data/, a tag error points at the metadata files.

Why does my bag fail validation after editing a file?

Because BagIt is intentionally rigid. Any byte change to a payload file alters its checksum, so the manifest no longer matches and validation fails — this is the feature working, not a bug. Never edit inside data/ and expect the bag to stay valid. Regenerate instead:

python
import bagit
bag = bagit.Bag("my-bag")
bag.save(manifests=True)   # recompute manifests after legitimate changes
bag.validate()             # raises BagValidationError on any mismatch

If you only need to fix metadata, edit bag-info.txt and re-save — the tag manifest updates, the payload manifest stays intact.

Decoding the most common BagIt errors

SymptomRoot causeFix
Payload-Oxum mismatchFiles added/removed/resized after baggingRe-bag or bag.save(manifests=True)
Checksum mismatch on one fileThat file was edited or corrupted in transferRestore from source, re-validate
"unexpected payload" fileStray .DS_Store, Thumbs.db, .tmpDelete the stray file, regenerate manifest
Missing file listed in manifestFile deleted or move interruptedRecopy the file, re-validate
Tag manifest invalidbag-info.txt hand-editedRe-save the bag so the tag manifest matches

How do I handle the hidden-file problem?

Operating systems litter directories with .DS_Store (macOS), Thumbs.db (Windows), and .AppleDouble files. If these appear after bagging, validators flag them as unexpected payload. Strip them before bagging:

bash
find ./source -name '.DS_Store' -delete
find ./source -name 'Thumbs.db' -delete
bagit.py --sha256 ./source   # bag a clean tree

Resist the urge to add them to the manifest — they are noise and should not be in the preservation payload at all.

What about very large datasets?

For multi-terabyte payloads, a holey bag ships the manifests and a fetch.txt listing download URLs, while the payload itself is retrieved on demand. The bag validates structurally up front and completes only when the data is fetched. This is ideal when content lives on remote storage or you want to transfer metadata first. Generate with bag.save() after populating fetch.txt, and remember a holey bag is incomplete until fetched — validate with --completeness awareness.

Building a reliable bagging routine

To avoid most troubleshooting entirely: bag from a clean, read-only copy of the source; always use SHA-256; never edit inside data/; validate immediately after creation and again after any transfer; and keep a log of each validation. A bag that validates at both ends of a transfer is your proof the bits arrived intact.

Key Takeaways

  • Most BagIt failures are payload-vs-manifest mismatches: a file changed after bagging.
  • Never hand-edit files in data/; regenerate manifests with the tooling instead.
  • A Payload-Oxum mismatch means files were added, removed, or resized.
  • Strip .DS_Store, Thumbs.db and temp files before bagging, not after.
  • Use holey bags with fetch.txt for very large or remotely hosted payloads.
  • Default to SHA-256 and validate at both ends of every transfer.
  • Use bagit-python, bagit-java, or Bagger — all conform to RFC 8493 and interoperate.

Frequently Asked Questions

What is a BagIt bag?

A BagIt bag is a directory holding your content under a 'data/' folder plus tag files (bagit.txt, manifests, bag-info.txt) that record fixity and metadata, so a transfer's completeness and integrity can be verified anywhere.

Why does my bag fail validation after I edit a file?

Editing any file in the payload changes its checksum, so the stored manifest no longer matches; you must regenerate the manifests (re-bag or update) whenever payload contents change.

What does an 'Oxum' or payload-oxum mismatch mean?

Payload-Oxum records the total byte count and number of payload files; a mismatch means files were added, removed, or changed in size since the bag was created, so the bag is incomplete or altered.

Should I use a holey bag for large data?

A holey bag with fetch.txt lets you ship metadata and manifests while the payload is downloaded from URLs on demand, which is useful for very large or remotely hosted datasets.

Why are hidden files like .DS_Store breaking my validation?

System files created after bagging are not in the manifest, so validators report them as unexpected payload; exclude them before bagging or regenerate the manifest.

Which tool should I use to create and validate bags?

Use the Library of Congress bagit-python (bagit.py) or bagit-java for scripting, and Bagger for a GUI; all produce interoperable bags conforming to RFC 8493.