Best Practices to Test a format migration safely

Q: Which tools verify that a migration preserved content?

Use ExifTool or ImageMagick `compare` for images, pdftotext diffs for documents, ffprobe for audio/video, and JHOVE/veraPDF for format validation. A pixel-level compare catches silent visual changes a checksum cannot.

To test a format migration safely, run it first on a representative sample, compare the significant properties of each output against its source, validate the output with a format checker, and only scale to the full collection once the sample passes cleanly. A checksum cannot help you here — migration deliberately changes the bytes — so the discipline is property-level verification plus documentation. This guide gives you a repeatable, defensible testing process and a working checklist.

Why is testing a migration different from other QA?

Most integrity checking asks "did this file change?" — and a checksum answers it. Migration inverts the question: the bytes must change, so the real question is "did the right things survive?" That means defining, before you start, which properties matter and verifying them on the output. Skip this and you risk a clean-looking batch that silently dropped a colour profile, an alpha channel, or the last page of every document.

What are the significant properties I must verify?

Significant properties are the characteristics that keep the object meaningful. They differ by type:

Object type	Properties to verify
Raster image	Dimensions, bit depth, colour profile, channel count
Document	Text content, page count, embedded fonts, layout
Audio	Sample rate, bit depth, channels, duration
Video	Resolution, frame count, codec, duration, audio track

Decide these up front. They become your pass/fail criteria, not a vague sense that "it looks fine".

How big should the test sample be?

Coverage beats percentage. A representative sample deliberately includes the awkward cases:

The largest and smallest files.
Each colour space and bit depth present.
Known-problem files (corrupt headers, unusual encodings).
One of every sub-type in the collection.

Ten well-chosen edge cases expose more than a thousand near-identical pages. Migrate the sample, verify it exhaustively, then scale.

Which tools verify the output?

Combine property extraction, content comparison, and format validation:

bash

# 1. Compare image properties before and after (ExifTool)
exiftool -ImageWidth -ImageHeight -BitsPerSample -ICC_Profile:all \
  source/img.tif output/img.jp2

# 2. Pixel-level visual difference (ImageMagick) — RMSE near 0 expected
compare -metric RMSE source/img.tif output/img.jp2 diff.png

# 3. Validate the output conforms to its target format
jhove -m JPEG2000-hul -h text output/img.jp2

# 4. Document text survived? diff the extracted text
diff <(pdftotext source/doc.pdf -) <(pdftotext output/doc_a.pdf -)

For audio/video, ffprobe -show_streams before and after confirms duration, sample rate and stream count line up.

Why isn't a checksum enough?

Because a checksum of the source and a checksum of the migrated output will always differ — that is the whole point of migration. Fixity is for detecting unwanted change over time, not for judging a transformation that is supposed to change the bytes. Reaching for a checksum here is a category error; you need property comparison and validation instead.

How do I make the test defensible and repeatable?

Documentation is what turns "I checked it" into evidence. For every test run, record:

Tool name and exact version (results change between versions).
The exact command used.
The before/after property comparison results.
The validation report for each output.
Counts: files migrated, passed, failed, queued for review.

A simple log row per file makes the whole batch auditable later.

A working safe-migration checklist

Define the significant properties for this object type.
Build a coverage-based sample including edge cases.
Migrate the sample with a pinned tool version; log the command.
Compare properties source-vs-output for every sample file.
Run a pixel/content diff to catch silent visual or text changes.
Validate every output against its target format spec.
Review and resolve all failures before scaling.
Run the full collection; keep masters until the batch is verified.
Record counts, versions and reports as preservation evidence.

Key Takeaways

Migration changes bytes on purpose, so verify properties, not checksums.
Define significant properties first; they are your pass/fail criteria.
Test on a coverage-based sample of edge cases before scaling.
Combine property extraction (ExifTool), content diff (compare/pdftotext), and validation (JHOVE/veraPDF).
Keep the masters until the full migration is verified.
Log tool versions, commands and results so the process is defensible.

Frequently Asked Questions

How do I test a format migration before running it on a whole collection?

Run it on a representative sample first, compare significant properties of the output against the source, validate the output with a tool like JHOVE or veraPDF, and only scale up once the sample passes every check.

What are significant properties in migration testing?

Significant properties are the characteristics that must survive migration to keep the object meaningful — for an image that may be pixel dimensions, bit depth and colour profile; for a document, text content, page count and layout. You verify these, not just that a file opens.

Which tools verify that a migration preserved content?

Use ExifTool or ImageMagick compare for images, pdftotext diffs for documents, ffprobe for audio/video, and JHOVE/veraPDF for format validation. A pixel-level compare catches silent visual changes a checksum cannot.

Why can't I just rely on a checksum after migration?

A checksum confirms a file hasn't changed since it was made, but migration deliberately changes the bytes, so source and output checksums will always differ. You need property-level comparison, not fixity, to judge a migration.

How big should my migration test sample be?

Choose a sample that covers the variety in the collection — different sizes, colour spaces, edge cases, known-problem files — rather than a fixed percentage. Ten well-chosen edge cases beat a thousand identical ones.

What should I keep as evidence that a migration was safe?

Keep the tool name and version, exact command, before/after property comparison results, validation reports and counts of files migrated, failed and reviewed, so the process is documented and defensible.

Why is testing a migration different from other QA? ​

What are the significant properties I must verify? ​

How big should the test sample be? ​

Which tools verify the output? ​

Why isn't a checksum enough? ​

How do I make the test defensible and repeatable? ​

A working safe-migration checklist ​

Key Takeaways ​

Frequently Asked Questions ​

How do I test a format migration before running it on a whole collection? ​

What are significant properties in migration testing? ​

Which tools verify that a migration preserved content? ​

Why can't I just rely on a checksum after migration? ​

How big should my migration test sample be? ​

What should I keep as evidence that a migration was safe? ​

Related reading ​