Best Practices to Capture born-digital metadata

To capture born-digital metadata well, automate the technical layer at the moment of transfer or disk imaging — format identifier, fixity checksum, file size, original path and timestamps — record it in a sidecar (DFXML or PREMIS) rather than embedding it, and validate every record against a fixed application profile so the whole collection is consistent and defensible. The single biggest mistake is capturing metadata after files have been opened, by which point original last-access dates and paths are already corrupted.

What counts as born-digital metadata?

Born-digital metadata spans three layers. Technical metadata describes the bytes: format (a PRONOM PUID like fmt/40 for a Word 97 document), byte size, character encoding, and a checksum. Filesystem metadata is the context the operating system held: original path, created/modified/accessed timestamps, and ownership. Descriptive and rights metadata adds human meaning: title, creator, date range, access conditions. For physical archives the descriptive layer dominates; for born-digital the technical and filesystem layers carry most of the evidential weight, because they are easy to destroy and impossible to reconstruct.

When should you capture it?

As early as physically possible. The order is: image the carrier, then derive metadata from the image — never from a working copy you have already browsed. Opening a folder in Windows Explorer rewrites last-access times; copying with a naive cp discards them entirely. Capture during transfer using a write-blocker and a tool that preserves timestamps.

bash

# Preserve timestamps, ownership and permissions on transfer (Linux/macOS)
rsync -a --times --info=stats2 /mnt/source/ /work/sip-117/payload/

# Generate filesystem-level metadata as DFXML from a disk image
fiwalk -X /work/sip-117/dfxml.xml /work/sip-117/disk.E01

How do you automate technical capture?

Chain three open tools. Siegfried identifies formats and emits PUIDs; ExifTool pulls embedded properties; a checksum tool stamps fixity. Run them over the payload and merge into one record.

bash

sf -json /work/sip-117/payload > meta/formats.json          # PRONOM PUIDs
exiftool -json -r /work/sip-117/payload > meta/embedded.json # EXIF/XMP/PDF info
sha256deep -r /work/sip-117/payload > meta/manifest.sha256   # fixity

This produces machine-readable evidence with zero typing, which is the only way to stay consistent across thousands of files.

What fields belong in a minimum profile?

Define these before you touch the first carrier. A tight profile beats an aspirational one nobody fills in.

Field	Source	Example
Persistent identifier	Assigned	`ark:/12345/sip117-0042`
Original filename	Filesystem	`Q3_budget.xls`
Original path	DFXML	`C:\Finance\2003\Q3_budget.xls`
Format (PUID)	Siegfried	`fmt/56`
Byte size	Filesystem	`48128`
Fixity (SHA-256)	sha256deep	`9f2c…`
Created / modified	DFXML	`2003-09-14T11:02:00Z`
Capture agent	Workflow	`BitCurator 4.x`

Embedded or sidecar — which for born-digital?

Sidecar, almost always. Embedding metadata into the original file changes its bytes and invalidates the checksum you just computed, which defeats the point of fixity. Keep a separate DFXML, JSON or PREMIS record alongside the untouched original. Reserve embedded metadata for access derivatives you create on purpose — for example writing rights statements into the XMP of a delivery PDF.

How do you keep a whole collection consistent and defensible?

Consistency is a process problem, not a data-entry problem. Write a metadata application profile, then enforce it: automation removes the temptation to skip fields, and schema validation catches the records that slipped through. Validate PREMIS or your local schema on ingest, and run a final audit query (for example, "list every object with no PUID") before closing the collection. Defensible means you can show how each value was derived and by which agent — so always record the tool and version that produced each field.

Key Takeaways

Capture metadata at transfer/imaging time, before any file is opened, to preserve original timestamps and paths.
Lean on the technical and filesystem layers — they carry the evidential weight and are easy to lose.
Automate format identification (Siegfried), embedded properties (ExifTool) and fixity (SHA-256) over the payload.
Use sidecars (DFXML, PREMIS, JSON) for originals; only embed metadata into derivatives you deliberately create.
Define a minimum application profile first; a small profile that is always complete beats a rich one that is half-filled.
Validate against a schema on ingest and audit the whole collection before closing it.
Record the agent and version behind every value so the metadata is defensible.

Frequently Asked Questions

What metadata must I capture for born-digital records?

At minimum: a persistent identifier, original filename and path, file format (a PRONOM PUID), file size, a fixity checksum, capture date, and the agent that created each derivative. Everything else builds on this technical core.

When should I capture born-digital metadata?

At the earliest possible moment — ideally during disk imaging or transfer, before any file is opened. Captured-after-the-fact metadata loses original timestamps, paths and last-access dates that are often the most evidential parts of the record.

What is the difference between descriptive and technical metadata here?

Technical metadata describes the bits (format, size, checksum, encoding); descriptive metadata describes the meaning (title, creator, dates, subjects). Born-digital work leans heavily on automated technical capture and adds descriptive metadata more sparingly.

Which tools capture born-digital metadata automatically?

Siegfried or DROID for format identification, ExifTool for embedded properties, fiwalk or the DFXML tools for filesystem-level metadata, and bulk_extractor for feature reports. BitCurator bundles most of these into one workflow.

Should I store metadata embedded or in a sidecar?

For born-digital you almost always keep a sidecar (DFXML, JSON, or a PREMIS record), because embedding alters the original bytes and breaks fixity. Reserve embedding for derivatives you create deliberately.

How do I keep metadata consistent across a whole collection?

Define a metadata application profile up front, automate capture so humans cannot skip fields, validate against a schema on ingest, and run a quality audit before the collection is closed.

What counts as born-digital metadata? ​

When should you capture it? ​

How do you automate technical capture? ​

What fields belong in a minimum profile? ​

Embedded or sidecar — which for born-digital? ​

How do you keep a whole collection consistent and defensible? ​

Key Takeaways ​

Frequently Asked Questions ​

What metadata must I capture for born-digital records? ​

When should I capture born-digital metadata? ​

What is the difference between descriptive and technical metadata here? ​

Which tools capture born-digital metadata automatically? ​

Should I store metadata embedded or in a sidecar? ​

How do I keep metadata consistent across a whole collection? ​

Related reading ​