Skip to content
File Formats & Migration

Run JHOVE when you are ingesting or migrating files in formats with strict specifications that you intend to preserve long term — TIFF, PDF/A, WAVE, JPEG 2000, XML — and skip it for ephemeral files, unsupported formats, or cases where DROID identification plus a checksum already gives you enough assurance. JHOVE answers one narrow question: is this file well-formed and valid against its format specification? That is genuinely useful at the archive boundary, but it is not a content-quality check and it costs time, so the decision is about fit, not reflex.

What does JHOVE actually validate?

JHOVE (JSTOR/Harvard Object Validation Environment) parses a file with a format-specific module and reports three things: the identified format, whether the file is well-formed (its structure parses), and whether it is valid (it meets the spec's semantic rules). A basic run:

bash
jhove -m TIFF-hul -h text masters/img_0042.tif

Output includes a Status: line of Well-Formed and valid, Well-Formed, but not valid, or Not well-formed. That status line is the signal you act on.

When should I run JHOVE?

Reach for JHOVE at two points in the lifecycle:

  1. At ingest, before you commit files to preservation storage, so malformed or non-conformant objects are caught at the door.
  2. After migration, to confirm a normalisation (say, JPEG to JPEG 2000, or PDF to PDF/A) produced a structurally sound output.

It earns its keep most when the format has a real specification to test against and the consequence of a silent defect is high — a broken TIFF that won't render in twenty years is exactly the failure preservation exists to prevent.

When is JHOVE not worth the effort?

Be honest about the trade-offs. JHOVE adds little or no value when:

  • The format has no JHOVE module (most office formats, many proprietary types).
  • The files are one-off or non-archival with no preservation mandate.
  • You already have DROID identification plus fixity and that meets your assurance level.
  • You need content quality (is the scan sharp? is the OCR complete?) — JHOVE cannot tell you that.

A "valid" result is a structural pass, not a guarantee the object is correct.

Well-formed vs valid: which matters more?

StatusMeaningTypical action
Well-Formed and validParses and meets the specAccept
Well-Formed, but not validParses, breaks a spec ruleReview, often accept with a note
Not well-formedStructure is brokenInvestigate; likely reject or repair

"Not well-formed" is the serious flag. "Well-formed but not valid" is frequently benign — a private TIFF tag, a tolerated PDF quirk — and should trigger a human look, not an automatic rejection.

How do I run JHOVE over a whole collection?

Use the headless driver and structured output so results are machine-checkable. JSON or XML handlers feed a report:

bash
for f in masters/*.tif; do
  jhove -m TIFF-hul -h json -o "reports/$(basename "$f").json" "$f"
done

Then filter for any status that is not Well-Formed and valid and queue only those for review. Across a 50,000-image batch this turns validation from a manual chore into an exception list of a few dozen files.

What are JHOVE's known quirks?

The PDF module is the main one to know: complex but perfectly usable PDFs can be flagged Not well-formed because the module is strict about cross-reference tables and encryption. Don't auto-reject on the PDF module alone — corroborate with veraPDF for PDF/A conformance. The TIFF, WAVE and XML modules are mature and you can trust them more directly.

Key Takeaways

  • JHOVE answers one question: is the file well-formed and valid for its format?
  • Run it at ingest and after migration for strict, preservation-grade formats.
  • Skip it for unsupported formats, ephemeral files, or when DROID plus fixity suffices.
  • "Not well-formed" is a hard flag; "not valid" usually warrants review, not rejection.
  • JHOVE checks conformance, never content quality.
  • Trust the TIFF/WAVE/XML modules; corroborate the PDF module with veraPDF.

Frequently Asked Questions

What does JHOVE actually check?

JHOVE checks whether a file is well-formed (syntactically correct for its format) and valid (conforms to the format specification's semantic rules), using format-specific modules for PDF, TIFF, JPEG, JPEG2000, WAVE, XML, HTML, GIF and others.

When should I run JHOVE in my workflow?

Run JHOVE at ingest, before normalisation, to catch malformed or non-conformant files at the door, and again after any migration to confirm the output is valid. It adds most value for formats with strict specs you intend to preserve long term.

When is JHOVE not worth running?

Skip JHOVE for formats it has no module for, for one-off personal files with no preservation mandate, or when DROID identification plus a checksum already meets your assurance needs. It validates conformance, not whether content is correct or complete.

What is the difference between well-formed and valid in JHOVE?

Well-formed means the file's structure parses correctly against the format grammar; valid means it additionally satisfies the specification's higher-level constraints. A file can be well-formed but not valid, which is a softer warning than malformed.

Does a JHOVE 'valid' result mean the file is good?

Not entirely. JHOVE confirms format conformance, not that the image is sharp, the text is complete or the file matches its source. Treat 'valid' as a necessary structural gate, not proof of content quality.

Which JHOVE modules are most reliable?

The TIFF, WAVE and XML modules are mature and trustworthy. The PDF module is useful but can flag complex valid PDFs as not-well-formed, so review its findings rather than auto-rejecting files.