Skip to content
Born-Digital Archives

To identify born-digital file formats reliably, match each file's internal byte signature against the PRONOM registry using a tool such as Siegfried or DROID, and never trust the extension alone. The result is a stable format identity (a PUID) for every file, a list of unidentified outliers to investigate, and the input you need for risk assessment and migration. Identification is the hinge of born-digital work: you cannot preserve, render or migrate what you have not correctly named.

Why are file extensions not enough?

An extension is a label anyone can change. A .doc might be a genuine Word binary, a renamed RTF, a Word 2007 OOXML file that should be .docx, or a corrupt fragment. A .dat could be almost anything. Identification by extension therefore produces confident-looking nonsense. Signature-based identification reads the actual leading bytes (the "magic number") and internal structure, so it reports what a file is, not what it claims to be.

What is PRONOM and how does signature matching work?

PRONOM is the format registry maintained by The National Archives (UK). Each format has byte signatures and a persistent identifier called a PUID, for example fmt/40 for a Word 97-2003 document. Identification tools carry the PRONOM signature set and scan each file's bytes for a match, returning the PUID, format name and version plus a confidence basis (signature, container or extension).

How do you run an identification pass?

Siegfried is the fastest route for a whole collection and slots straight into a pipeline:

bash
# Recursively identify everything, output CSV for analysis
sf -csv ./ACC-2025-061 > formats.csv
# Summarise: which formats, how many of each?
sf -csv ./ACC-2025-061 | csvcut -c format | sort | uniq -c | sort -rn

For graphical work and built-in reporting, DROID does the same against PRONOM:

bash
# DROID command-line equivalent
droid -a ./ACC-2025-061 -p profile.droid
droid -p profile.droid -E formats_report.csv

Run identification on a working copy after imaging, so the source media is never touched.

DROID or Siegfried: which should you use?

AspectDROIDSiegfried
InterfaceGUI plus CLICLI only
Speed at scaleGoodVery fast
ScriptabilityWorkableExcellent
ReportingRich built-inCSV/JSON, you build it
Container inspectionYesYes

Use Siegfried in automated pipelines over large accessions; reach for DROID when you want point-and-click profiling and ready-made reports. Many archives run both and reconcile the results.

What do you do with unidentified files?

Unidentified files are signals, not noise. Inspect the header to see what you are dealing with:

bash
# Look at the leading bytes and a hex view of the header
file mystery.bin
xxd mystery.bin | head

If the format is real but simply absent from PRONOM, you can contribute a new signature so the wider community benefits. If it is corruption or truncation, flag it for the donor or for recovery. A high count of unidentified files is one of the clearest early warnings of a preservation problem.

How does identification feed preservation?

Once every file has a PUID, you can cross-reference against format risk registries and significant-properties guidance to decide what needs migrating, what can stay, and what needs special rendering or emulation. Re-run identification after each migration to confirm the output is what you intended, and periodically as PRONOM grows, because yesterday's mystery file may match a freshly added signature today.

Key Takeaways

  • Identify by internal signature, never by extension alone.
  • PRONOM gives each format a stable, citable PUID; DROID and Siegfried match against it.
  • Use Siegfried for fast, scriptable pipelines and DROID for GUI reporting.
  • Run identification on a working copy, after imaging.
  • Treat unidentified files as a risk flag to investigate, not ignore.
  • Re-identify after migrations and periodically as signatures improve.

Frequently Asked Questions

Why not just trust the file extension?

Extensions are user-editable labels, not facts. A file named report.doc may be a renamed RTF, a corrupt object or even an executable, so reliable identification reads the internal signature rather than the name.

What is PRONOM and why does it matter?

PRONOM is The National Archives' registry of file formats and their byte signatures, each with a PUID identifier. Tools like DROID and Siegfried match files against it, giving you a stable, citable format identity rather than a guess.

What is the difference between DROID and Siegfried?

Both identify formats against PRONOM. DROID is a Java application with a graphical interface and reporting; Siegfried is a fast command-line tool that is easy to script and to run over large collections in a pipeline.

What do I do about files with no match?

Investigate them: inspect the header bytes, check whether the signature is simply not in PRONOM yet, and consider submitting a new signature. Unidentified files are a preservation risk flag, not something to ignore.

Does identification tell me if a file is at risk?

It is the first step. Once you have format identities, you cross-reference them against risk registries and significant-properties guidance to decide which need migration or special handling.

How often should I re-run identification?

Re-run it on ingest, after any migration, and periodically as signatures improve. A format that was unidentified last year may match a newly added PRONOM signature today.