Appearance
To use PDF/A well, convert born-digital or scanned documents to a self-contained, validated PDF/A file, then verify it with veraPDF before ingest. PDF/A (ISO 19005) is a constrained subset of PDF designed so a document renders identically decades from now: fonts are embedded, colour is unambiguous, and external dependencies are forbidden. The practical work is choosing the right part and level, converting cleanly, and proving conformance.
Which PDF/A part and conformance level do I need?
There are three parts and, depending on part, the levels b (basic), a (accessible) and u (Unicode). Pick the part by what the document needs:
| Part | Year | Key additions | Use when |
|---|---|---|---|
| PDF/A-1 | 2005 | Strict baseline | Maximum compatibility, simple documents |
| PDF/A-2 | 2011 | JPEG 2000, transparency, layers, embed PDF/A | Most modern documents |
| PDF/A-3 | 2012 | Embed any file type | Keeping source data with the document |
For most archives, PDF/A-2b is the sensible default: it allows efficient JP2 image compression and modern features while staying strict. Reserve PDF/A-3 for cases where you deliberately bundle a source file (a spreadsheet, an XML record) inside the PDF.
How do I convert a document to PDF/A?
For born-digital sources, export directly from the authoring tool to PDF/A where possible. For batch conversion of existing PDFs, Ghostscript is reliable:
bash
gs -dPDFA=2 -dBATCH -dNOPAUSE -sColorConversionStrategy=UseDeviceIndependentColor \
-sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 \
-sOutputFile=out_pdfa.pdf PDFA_def.ps input.pdfdPDFACompatibilityPolicy=1 makes Ghostscript refuse to silently drop PDF/A-violating content rather than producing an invalid file. The PDFA_def.ps file points to an embedded ICC output profile (for example sRGB), which is mandatory.
How do I validate PDF/A so I can trust it?
veraPDF is the open-source industry reference validator. Validate against the exact flavour you targeted:
bash
verapdf --flavour 2b --format text out_pdfa.pdf
# exit code 0 = compliant; non-zero = failures listed by ISO clauseA passing report is your evidence of conformance. Store the report alongside the file or in your preservation metadata so a future curator can see what was checked and when.
Why does my PDF/A fail validation?
Most failures cluster around a few rules. Work through them in this order:
- Fonts not embedded — every glyph used must carry its font program. Re-export with full embedding.
- Encryption present — PDF/A forbids it. Remove the security handler before conversion.
- Transparency or blend modes in PDF/A-1 — upgrade the target to PDF/A-2 or flatten.
- No output intent / colour profile — add an ICC profile via the conversion definition.
- External references — fonts, images or links to files outside the PDF break self-containment.
veraPDF cites the precise failed clause (for example 6.1.13), so fix the named rule rather than guessing.
Should scanned documents get an OCR layer?
Add one. An image-only PDF/A is valid but offers no searchable or extractable text, which undermines discovery and accessibility. Run OCR (Tesseract via OCRmyPDF) to produce an invisible text layer that aligns with the image:
bash
ocrmypdf --output-type pdfa -l eng+lat scanned.pdf searchable_pdfa.pdf--output-type pdfa makes OCRmyPDF emit and validate PDF/A in one step, and -l eng+lat handles mixed English and Latin pages common in historical material.
Key Takeaways
- PDF/A is a constrained, self-contained PDF subset (ISO 19005) for long-term rendering.
- Default to PDF/A-2b; use PDF/A-3 only when embedding source files; use level A when you can supply tagged structure.
- Convert with Ghostscript using
dPDFACompatibilityPolicy=1, embedding a mandatory ICC output profile. - Always validate with veraPDF and keep the report as conformance evidence.
- Add an OCR text layer (OCRmyPDF) so scanned documents are searchable and accessible.
- PDF/A is necessary but not sufficient — pair it with fixity, replication and metadata.
Frequently Asked Questions
What is the difference between PDF/A-1, PDF/A-2 and PDF/A-3?
PDF/A-1 (2005) is the strict baseline with no transparency, layers or JPEG 2000. PDF/A-2 (2011) adds JPEG 2000 compression, transparency, layers and PDF/A file embedding. PDF/A-3 allows any file type to be embedded, which is useful for keeping source data alongside the document.
Should I choose conformance level A or B?
Level B (basic) only guarantees reliable visual reproduction. Level A (accessible) additionally requires tagged structure and Unicode text mapping for accessibility and reflow. Choose A when you have or can generate good structure; otherwise B is an honest, valid choice.
How do I convert an existing PDF to PDF/A?
Use Ghostscript, veraPDF-driven tooling or a trusted converter, then validate. Ghostscript with the PDF/A definition file produces PDF/A-2b; always re-validate the output with veraPDF because conversion can silently fail or downgrade.
Why does my PDF/A fail validation?
The most common causes are non-embedded fonts, untagged transparency, encryption, external references, and missing colour profiles. veraPDF reports the exact failed clause so you can fix the specific rule rather than guessing.
Can PDF/A contain scanned images without OCR text?
Yes. An image-only PDF/A is valid at level B, but it has no searchable or extractable text. Add an OCR text layer to make the document searchable and to satisfy level A accessibility requirements.
Does PDF/A guarantee long-term preservation on its own?
No. PDF/A guarantees the document is self-contained and renderable, but you still need fixity checks, replication, format-health monitoring and metadata. Treat PDF/A as one component of an OAIS-style preservation system, not a complete solution.