Skip to content
File Formats & Migration

Normalise files on ingest when the incoming format is risky, proliferating, or poorly supported and conversion to a preferred preservation format preserves the significant properties cleanly. Do not normalise when the original holds authenticity value, when conversion would be lossy, or when the file already arrives in a sound preservation format. Normalisation is a policy choice that trades up-front effort and some loss risk for a smaller, more sustainable set of formats to manage long-term.

What is normalisation on ingest, really?

Normalisation converts incoming submissions into a small, deliberately chosen set of preservation formats at the point of ingest. Instead of holding two hundred format variants forever, you hold a handful you have committed to support. Tools like Archivematica do this automatically using a format-policy registry that maps each incoming PUID to a normalisation rule. The original submission is kept; the normalised copy becomes the preservation master that downstream processes rely on.

When does normalising on ingest pay off?

It pays off when format proliferation or risk is your dominant problem. Signals that favour normalisation:

  • Submissions arrive in many short-lived or proprietary formats.
  • You lack the staff to monitor and migrate dozens of formats individually.
  • A reliable, lossless conversion path to a preferred format exists.
  • Consistency across the collection matters for access and discovery.

In these cases, paying the conversion cost once at ingest is cheaper than facing scattered, urgent migrations across many formats for years.

When should I NOT normalise on ingest?

Skip normalisation when it would destroy value or is simply unnecessary:

SituationWhy not normalise
Original is the authentic recordConversion alters the artefact's evidential value
Only lossy conversion existsYou would discard information permanently
File is already a preservation formatNo risk to mitigate; conversion adds risk
Format carries significant behaviourStatic conversion loses interactivity (emulate instead)
Legal or rights constraintsYou may not transform the bitstream

For these, keep the original as the preservation copy and document why you exempted it.

How do I decide per format, not per collection?

Make the decision at the format level with a policy registry, not as a blanket rule. A simple decision sketch:

text
for each incoming PUID:
    if PUID is a preferred preservation format  -> keep as-is
    elif lossless target exists AND not authenticity-critical -> normalise
    elif only lossy target exists               -> keep original, flag for review
    else                                        -> quarantine for manual appraisal

In Archivematica this maps to format policy rules with preservation and access commands per format. The point is that one collection can have dozens of correct, different answers.

What does a sound normalisation workflow include?

Whatever the tool, the workflow must do five things or it is not trustworthy:

  1. Identify every file by signature (DROID/Siegfried), not extension.
  2. Decide via a documented format-policy registry.
  3. Convert to the preferred target with a recorded tool and version.
  4. Validate the output (JHOVE/veraPDF) and compare significant properties.
  5. Retain the original plus the normalised copy, logging both as PREMIS events.

Pilot this on a representative sample before turning it loose on a whole accession; silent, batch-wide loss is the failure mode you most want to avoid.

Key Takeaways

  • Normalise on ingest to shrink the set of formats you must support, not as a reflex.
  • Favour normalisation when formats proliferate or are risky and a lossless path exists.
  • Never normalise when the original is authenticity-critical, only lossy conversion exists, or the format is already sound.
  • Decide per format with a documented policy registry, not one rule for the whole collection.
  • Always keep the original alongside the normalised preservation copy.
  • Validate output and log both files as preservation events; pilot before batching.

Frequently Asked Questions

What does normalising on ingest actually mean?

Normalising on ingest means converting incoming files to a small set of preferred preservation formats at the moment they enter the archive, rather than keeping every format as received. The goal is to reduce the number of formats you must support over time.

Should I always normalise on ingest?

No. Normalise when the source format is risky or proliferating and conversion preserves significant properties cleanly. Do not normalise when the original carries authenticity value, when conversion is lossy, or when the format is already a sound preservation format.

Does normalisation replace the original file?

It should not. Best practice keeps the original (the submission) alongside the normalised preservation copy, so authenticity is retained and you can re-normalise later with better tools.

How does normalisation differ from later migration?

Normalisation happens once, automatically, at ingest to standardise formats up front. Migration happens later, reactively, when a format you are already holding becomes obsolete. Normalising well reduces how much migration you face later.

What are good default preservation targets for normalisation?

Common choices are TIFF or lossless JP2 for images, WAV or FLAC for audio, FFV1/Matroska for video, PDF/A for documents, and CSV plus a structured format for tabular data. Confirm targets preserve your significant properties first.

What is the main risk of normalising on ingest?

The main risks are silent information loss during conversion and applying a one-size policy to files that needed special handling. Pilot the workflow, validate output, and exempt formats where the original must be preserved as-is.