Skip to content
File Formats & Migration

To use PDF/A well, convert born-digital or scanned documents to a self-contained, validated PDF/A file, then verify it with veraPDF before ingest. PDF/A (ISO 19005) is a constrained subset of PDF designed so a document renders identically decades from now: fonts are embedded, colour is unambiguous, and external dependencies are forbidden. The practical work is choosing the right part and level, converting cleanly, and proving conformance.

Which PDF/A part and conformance level do I need?

There are three parts and, depending on part, the levels b (basic), a (accessible) and u (Unicode). Pick the part by what the document needs:

PartYearKey additionsUse when
PDF/A-12005Strict baselineMaximum compatibility, simple documents
PDF/A-22011JPEG 2000, transparency, layers, embed PDF/AMost modern documents
PDF/A-32012Embed any file typeKeeping source data with the document

For most archives, PDF/A-2b is the sensible default: it allows efficient JP2 image compression and modern features while staying strict. Reserve PDF/A-3 for cases where you deliberately bundle a source file (a spreadsheet, an XML record) inside the PDF.

How do I convert a document to PDF/A?

For born-digital sources, export directly from the authoring tool to PDF/A where possible. For batch conversion of existing PDFs, Ghostscript is reliable:

bash
gs -dPDFA=2 -dBATCH -dNOPAUSE -sColorConversionStrategy=UseDeviceIndependentColor \
   -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 \
   -sOutputFile=out_pdfa.pdf PDFA_def.ps input.pdf

dPDFACompatibilityPolicy=1 makes Ghostscript refuse to silently drop PDF/A-violating content rather than producing an invalid file. The PDFA_def.ps file points to an embedded ICC output profile (for example sRGB), which is mandatory.

How do I validate PDF/A so I can trust it?

veraPDF is the open-source industry reference validator. Validate against the exact flavour you targeted:

bash
verapdf --flavour 2b --format text out_pdfa.pdf
# exit code 0 = compliant; non-zero = failures listed by ISO clause

A passing report is your evidence of conformance. Store the report alongside the file or in your preservation metadata so a future curator can see what was checked and when.

Why does my PDF/A fail validation?

Most failures cluster around a few rules. Work through them in this order:

  • Fonts not embedded — every glyph used must carry its font program. Re-export with full embedding.
  • Encryption present — PDF/A forbids it. Remove the security handler before conversion.
  • Transparency or blend modes in PDF/A-1 — upgrade the target to PDF/A-2 or flatten.
  • No output intent / colour profile — add an ICC profile via the conversion definition.
  • External references — fonts, images or links to files outside the PDF break self-containment.

veraPDF cites the precise failed clause (for example 6.1.13), so fix the named rule rather than guessing.

Should scanned documents get an OCR layer?

Add one. An image-only PDF/A is valid but offers no searchable or extractable text, which undermines discovery and accessibility. Run OCR (Tesseract via OCRmyPDF) to produce an invisible text layer that aligns with the image:

bash
ocrmypdf --output-type pdfa -l eng+lat scanned.pdf searchable_pdfa.pdf

--output-type pdfa makes OCRmyPDF emit and validate PDF/A in one step, and -l eng+lat handles mixed English and Latin pages common in historical material.

Key Takeaways

  • PDF/A is a constrained, self-contained PDF subset (ISO 19005) for long-term rendering.
  • Default to PDF/A-2b; use PDF/A-3 only when embedding source files; use level A when you can supply tagged structure.
  • Convert with Ghostscript using dPDFACompatibilityPolicy=1, embedding a mandatory ICC output profile.
  • Always validate with veraPDF and keep the report as conformance evidence.
  • Add an OCR text layer (OCRmyPDF) so scanned documents are searchable and accessible.
  • PDF/A is necessary but not sufficient — pair it with fixity, replication and metadata.

Frequently Asked Questions

What is the difference between PDF/A-1, PDF/A-2 and PDF/A-3?

PDF/A-1 (2005) is the strict baseline with no transparency, layers or JPEG 2000. PDF/A-2 (2011) adds JPEG 2000 compression, transparency, layers and PDF/A file embedding. PDF/A-3 allows any file type to be embedded, which is useful for keeping source data alongside the document.

Should I choose conformance level A or B?

Level B (basic) only guarantees reliable visual reproduction. Level A (accessible) additionally requires tagged structure and Unicode text mapping for accessibility and reflow. Choose A when you have or can generate good structure; otherwise B is an honest, valid choice.

How do I convert an existing PDF to PDF/A?

Use Ghostscript, veraPDF-driven tooling or a trusted converter, then validate. Ghostscript with the PDF/A definition file produces PDF/A-2b; always re-validate the output with veraPDF because conversion can silently fail or downgrade.

Why does my PDF/A fail validation?

The most common causes are non-embedded fonts, untagged transparency, encryption, external references, and missing colour profiles. veraPDF reports the exact failed clause so you can fix the specific rule rather than guessing.

Can PDF/A contain scanned images without OCR text?

Yes. An image-only PDF/A is valid at level B, but it has no searchable or extractable text. Add an OCR text layer to make the document searchable and to satisfy level A accessibility requirements.

Does PDF/A guarantee long-term preservation on its own?

No. PDF/A guarantees the document is self-contained and renderable, but you still need fixity checks, replication, format-health monitoring and metadata. Treat PDF/A as one component of an OAIS-style preservation system, not a complete solution.