Blog Post

What Is PDF/A? A Complete 2023 Guide

PSPDFKit
Illustration: What Is PDF/A? A Guide to PDF/A and How to Convert to Any Type of PDF/A Programmatically

In this post, you’ll learn about PDF/A and find out everything you need to know to get the most out of the format when safeguarding your electronic documents.

Information

If you’re already PDF/A-savvy and you’re interested in programmatic conversion, you can get started by testing our configurable PDF/A library and supported conversion strategies.

What Is PDF/A and How Can You Use It to Improve Your Document Preservation?

In PDF/A, the A stands for archival. PDF/A is an open, international (ISO) standard suited to the challenges of preserving electronic and digital documents.

So, what is PDF/A? It’s a type of PDF designed to safeguard the longevity of your documents, irrespective of future technological changes. And as a format, it’s a highly flexible and reliable document archival solution, evolving with the demands of today’s global, digital, and increasingly accessibility-focused environment.

Who Uses PDF/A and Why?

The PDF/A family of standards was introduced in 2005 and quickly found a broad user base. Today, many organizations, especially in Europe, mandate PDF/A as their go-to format for electronic document preservation. Many national archives and libraries, such as NARA and PACER in the United States, also recommend PDF/A.

Regulatory compliance is a significant reason many businesses adopt PDF/A. They often face audits and must keep complete, accurate, and unchanged document records for decades — and sometimes forever. These records need to remain usable and accessible across many generations of technology.

Using PDF/A makes this process smoother due to PDF/A’s widespread acceptance and advantages over other archival formats.

PDF/A Benefits

PDF/A is a platform-independent format containing all the necessary resources for universal rendering of the original intended visual appearance of documents.

Beyond visible information, PDF/A also preserves hidden file elements integral to electronic documents. These elements include document and object metadata critical to efficient document management. PDF/A also supports strict file structure and text encoding standards, ensuring a document’s long-term technical accessibility.

In addition, PDF/A shares substantial benefits of its parent format PDF:

  1. PDF/A, governed by the PDF imaging model, itself based on PostScript, combines vector graphics and raster images for accurate display and printing.

  2. It has sophisticated compressions (FlateDecode) and advanced image formats (JPEG2000) to reduce file sizes for storage while preserving quality in both scanned and image-heavy documents.

  3. You can open and view PDF/A consistently using any of the widely available PDF readers, including in any standard browser through simple drag and drop. The same can’t be said for alternatives like TIFF images or Word documents.

When to Use PDF/A

PDF/A is excellent for preserving digitally born documents — such as those created by any desktop publishing program — that include live text, complex formatting, embedded raster images, colors, vector graphics, and pagination.

With its support for advanced compression and composition techniques, PDF/A is also great for scanned documents. In turn, this leads to smaller, storage-friendly files.

PDF/A comes in many versions, or “flavors,” serving niche applications — such as archiving hybrid electronic invoices, engineering documents with embedded 3D models, and emails with file attachments.

That said, PDF/A is less suited to content with dynamic or multimedia aspects, like websites and spreadsheets, as they require removing or “flattening” most dynamic parts into static images, which results in lost information.

Why Can’t I Just Archive Regular PDFs? PDF vs. PDF/A

This section will discuss how a PDF/A differs from a regular PDF and why those differences matter for archiving.

A PDF/A is another type of PDF file. What makes a PDF/A different is that it has met additional, exacting requirements through PDF/A conversion and verification. Verification ensures a document file is completely self-contained and device independent, and thus safe for archiving purposes and submissions.

Why PDF Doesn’t Work for Archiving — The Problem of Malformed Files

The initial concept behind regular PDFs was for them to preserve the original look of documents — that regardless of the system or software used, PDFs would render identically. But, in practice, that’s not always the case, as there are many ways to create a PDF file, even if they look the same.

Yet what many users don’t realize is that this surface-level uniformity can be deceptive.

When viewed, a malformed PDF often appears no different to a well-formed one due to a viewer’s built-in repair function. This function handles issues backstage, so contemporary users don’t experience trouble. But problems can and do crop up years later, making verification critical so as to avoid issues with fonts, for example, and other PDF preservation risks.

Example: Non-Embedded Fonts

Consider the example of fonts. The PDF specification highly recommends embedding font sets for rendering visible characters in a document. But not all PDF generators embed font information; instead, they reference a font that’s somewhere else, such as a user-installed font or a preinstalled system font.

When a modern viewer can’t recover font data, users encounter substituted fonts or synthesized fonts after auto-repair. Users can then encounter modified text or less-than-smooth performance, because synthesizing vector fonts eats up more memory. Afterward, the text is (usually) still readable.

In an extreme scenario, however, text completely disappears or becomes garbled and thus illegible. Even today, a PDF viewer can fail where text data is unrecoverable.

PDF/A non-embedded fonts

PDF/A non-embedded fonts error

Figure 1: The images above show an extreme example from a PDF.js community issue for non-embedded fonts. The parent issue was reported in 2014 and closed in 2021 without an official PDF.js contributor community fix. The recommended solution is for all PDF creators to embed their font sets, or at least a subset — something PDF/A requires.

How PDF/A Works to Safeguard Your PDFs

The reliability of accessing external resources like fonts in PDFs diminishes over time.

So does the ability of developers to compensate for malformed and corrupted files, leading to loss of information and errors in visual appearance if the underlying issues aren’t corrected.

PDF/A solves these preservation risks through its extra restrictions and requirements.

Some examples:

  1. All fonts must be correctly embedded to ensure universal rendering.

  2. External content references are forbidden.

  3. LZW image compression is banned.

  4. Audio and video content are forbidden, meaning no audio or movie annotations are allowed.

  5. Use of a machine-readable and standards-based metadata is required via the Extensible Metadata Platform (XMP).

  6. JavaScript and executable files are forbidden.

  7. Color spaces must be specified in a device-independent manner to preserve color fidelity.

  8. File-level encryption is not allowed.

What Is PDF/A Verification and Why Is It Necessary?

PDF/A verification, or validation, reveals whether or not a PDF is an authentic PDF/A file. In the past, validation tools often disagreed, which is a problem addressed by veraPDF, an open source PDF/A validation suite backed by an industry coalition.

A PDF/A converter program must ensure documents align with the PDF/A standard, and converted files must therefore pass a rigorous PDF/A verification process. It’s then up to developers to ensure their converters work correctly and for users to pick tools following the PDF/A and ISO specifications.

Automated verification is critical because it isn’t possible to tell an archive-safe PDF/A document from an unsafe PDF document by just looking at one, and manual object inspection isn’t suited to high throughput server environments.

Verification’s Nuts and Bolts

During PDF/A verification, a tool or program rigorously inspects objects in a file. The validator then recommends fixes based on errors it encounters, such as elements that need removal (encryption) or information that needs embedding (fonts).

Developers who craft well-formed PDFs will get files through verification more smoothly, with minimum need for repair. However, if you convert files whose creation you cannot control or at a stricter PDF/A conformance, then verification becomes more challenging.

Once a file passes muster, it’s awarded the PDF/A metadata flag to show the world it’s an authentic PDF/A document and it’s safe for archiving — in theory, anyway.

PDF/A Verification Challenges

In the past, PDF/A files converted and verified in one tool would fail verification in another due to different vendor interpretations of the complex PDF/A specifications. This clash of the converters made the PDF/A flag a questionable indicator of a converted file’s actual archive-readiness.

Today, the veraPDF suite offers a reliable and highly detailed PDF/A verification tool, correcting past PDF verification conflicts. We also make use of its precision to fine-tune our own PDF/A conversion and validation engine.

What Is veraPDF and How Can You Use It?

VeraPDF logo

Recognizing the need for a standardized and reliable way to validate the PDF/A format, the PDF industry formed the veraPDF consortium in 2014 with European Union PREFORMA project funding. The coalition, led by the Open Preservation Foundation (OPF) and the PDF Association, developed a precise, open source validator for PDF/A that could then serve as the go-to tool for individuals, vendors, and archivists alike.

Information

veraPDF performs a close, object-by-object examination of documents, positioning it as an ideal benchmarking tool. However, veraPDF’s analysis, while highly precise, isn’t suited for server conversions requiring validation in seconds, as these demand a more efficient approach.

Today, the veraPDF suite includes the following:

  • A graphical user interface (GUI) desktop version to test a single file

  • A command-line interface (CLI) for batch processing

  • An online demo

  • A Java library — An API for developers to embed veraPDF in their Java-based apps

What Are the Different PDF/A Versions and Conformances? Why Do They Matter?

PDF/A comes in different versions to support different archival uses, and the conformance level you pick will impact the ease in which your files pass PDF/A verification and features present in your files.

The PDF/A standard has four versions, each released in a different year and based on a newer version of the core PDF specification.

  • PDF/A-1 (ISO 19005-1:2005), based on PDF 1.4 (no ISO)

  • PDF/A-2 (SO 19005-2:2011), based on PDF 1.7 (ISO 32000-1:2008)

  • PDF/A-3 (ISO 19005-3:2012), also based on PDF 1.7 (ISO 32000-1:2008), with minor changes from PDF/A-2

  • PDF/A-4 (ISO 19005-4:2020), based on PDF 2.0 (ISO 32000-2:2020)

Information

PDF 2.0 was first published by ISO in 2017 as 32000-2:2017. It was then updated in 2020 to 32000-2:2020.

PDF/A also supports several conformance levels; the available conformances vary with the selected version of PDF/A.

Here are the available conformances by PDF/A version:

  • PDF/A-1a and PDF/A-1b

  • PDF/A-2a, PDF/A-2b, and PDF/A-2u

  • PDF/A-3a, PDF/A-3b, and PDF/A-3u

  • PDF/A-4, PDF/A-4e, and PDF/A-4f

What Is the “Right” PDF/A Version and Conformance?

A common challenge for developers and users is choosing the appropriate PDF/A versions and conformances during conversion. Given the myriad of PDF/A types, the distinctions can sometimes appear nuanced. The following sections will provide some clarification to help you make an informed choice.

Choosing a PDF/A Version and Conformance

A common belief is to see PDF/A versions as replacements — like newer versions of software. This is somewhat true; newer specifications bring greater clarity to archival workflows and improvements in the later PDF specifications beneath the different PDF/A versions.

That said, PDF/A versions and conformances can significantly impact your converted file sizes, the types of information safeguarded, and the ease in which your files pass validation. Therefore, PDF/A variants can also be seen as different tools fulfilling slightly different functions. The best practice is to pick the most efficient “tool” — or version — for the job.

So, which “tool” should you use?

PDF/A-2b is the safest bet, as it’s the most compatible variant and the widely recommended default, especially for scanned documents.

In contrast, PDF/A-1 is considered out of date; the only reason some use it is because of a policy or requirement by the receiving party.

If your archival use case is more complex, consider PDF/A-4, which offers enhanced capabilities for complexities such as preservation of logical structure, PDF’s trademark dynamic, interactive forms, 3D model annotations, and digital signatures.

Also, as a final note, all PDF/A standards are backward compatible, so even if you’re forced to create PDF/A-1 files, they’ll still open flawlessly in any ISO-compliant PDF reader.

PDF/A-4 — The Next Generation PDF/A?

Working with the latest specifications to ensure up-to-date functionality, compatibility, and security in any software is a good idea.

And PDF/A-4 (ISO 19005-4:2020) represents more than a decade of experience and collaboration between the major industry players, all focusing on making PDF/A both more cost-effective and flexible for various purposes.

In summary: On top of many clarifications unique to archiving, PDF/A-4 inherits PDF 2.0 improvements to support richer forms of content and improved capabilities for machine-readable text; dynamic, user-fillable forms; and logical structure.

The end result? More cost-effective development of accessible documents, which are easier to repurpose for a variety of reasons, such as:

  • Text extraction and AI

  • Conversion to responsive HTML for small devices

  • Use within assistive technologies like screen readers and text-to-speech systems

PDF/A-4 also inherits PDF 2.0 support for PDF Advanced Electronic Signatures (PAdES), a new type of digital signature that’s verifiable over long periods of time — in addition to several other improvements.

How to Convert Your Documents into Future-Proof PDF/A

Are you keen on adding efficient and high-quality PDF/A functionality into your document processing application?

PSPDKFit offers a configurable PDF/A library that integrates into any web or server app. It supports fast, professional conversion and validation to any PDF/A variant, with individual files converting and validating in seconds during server processing.

You can also convert PDFs, Office documents, and images to PDF/A client-side, in turn offloading processing, reducing server load, and enhancing web app scalability with superior performance during peak traffic. Additionally, it bolsters security by limiting data transmission over a network.

Selecting Your PDF/A Conversion Strategy

Not all PDF instructions are universally compatible with all conformances. So, our PDF/A converter can be configured as needed to any of the following strategies in support of your archival needs and policy goals.

  • Page content stream conversion involves converting instructions within a PDF’s content stream to a compatible format. By preserving the original structure of the content, it ensures accurate rendering and fidelity during conversion. This is the preferred strategy, and when converting to the default (recommended) PDF/A-2b conformance, the success ratio is around 97 percent.

  • Page content vectorization converts page content into scalable vector graphics, enabling high-quality scaling and smooth rendering across different devices and resolutions. Transforming instructions into a mathematical representation allows for precisely reproduced shapes, lines, and text.

  • Page content rasterization converts page content into a grid of pixels, resulting in a raster image representation. This method may sacrifice some scalability, but it offers broader compatibility and ensures consistent visual appearance, regardless of the rendering environment. This approach is required chiefly as a fallback when converting modern documents to older PDF/A conformance formats like PDF/A-1b, which have limited page content composition features.

By dynamically selecting the most appropriate conversion strategy (above), you maximize your chances of successful and accurate conversion, meaning you’ll be expending less elbow grease in getting files to pass.

Our library also comes with additional, purpose-built features to serve your needs and smooth out conversions:

  • Built-in repair — The engine includes its own robust PDF repair capability to recover malformed or corrupted documents. The features seamlessly conduct repair operations whenever necessary. Detecting and addressing issues with PDF files ensures the retrieval of valuable data from potentially problematic records without any disruption or inconvenience.

  • Built-in optimization — Thanks to its PDF size reduction capabilities, font, and resource optimization, developers can leverage PDF/A conversion processes to produce highly optimized documents.

Next Steps

Are you ready to test our PDF/A conversion capabilities? Get started for free by visiting the downloads page for trial information and packages in your preferred language. Then check out the PDF/A documentation for the demo and samples.

PSPDFKit’s PDF/A conversion library includes features to ensure seamless and cost-effective integration in a variety of environments. Let’s briefly recap:

  • ISO-compliant and veraPDF-tested engine

  • Highly configurable in support of your archival strategy and policies

  • Built-in repair engine to recover information from malformed records

  • Advanced compression for highly efficient storage without quality loss

  • All PDF/A versions and conformances supported, including PDF/A-4, PDF/A-4e, and PDF/A-4f

  • Fast and smooth bulk processing, including client-side conversion

When you’re ready to use it in a production environment, you can speak with our Sales team. And if at any point you have questions or concerns, it’s easy to contact the engineers who built the product.

FAQS

What Are PDF/A’s Advantages over Other Formats?

In this section, we’ll cover some PDF/A alternatives for documents and, in contrast, the benefits of PDF/A.

PDF/A vs. Word

The Word format, while common, has notable drawbacks for archiving purposes compared to PDF/A:

  • As a proprietary format, it lacks an ISO standard designated for archiving, and as a result, it doesn’t have the guarantees of international standardization in terms of interoperability and universal viewing.

  • Word has limits for ensuring color consistency and preserving a fixed representation of documents with complex graphics and formatting.

  • PDF has better compression than Word, leading to smaller files.

  • Word’s font handling could also be improved.

PDF/A vs. TIFF

Before PDF/A, archivists used fully rendered image formats such as JPEG and TIFF, which were invented when most archived documents still came in paper form.

TIFF documents might still exist in your archives, and specific business programs could continue generating TIFF imagery. Even files that look like PDFs may only wrap images within PDF coverings.

But TIFF has its known drawbacks for business infrastructures that need consistent document searchability, universal viewing, and efficient storage.

Documents rasterized into TIFF create a bitmap, which is a file storing image data for every pixel. Thus, as a sizable static snapshot of the original, TIFF doesn’t natively support text searchability, and it consumes more disk space.

PDF/A vs. HTML

HTML isn’t inherently bad for archiving, but it has certain limits for long-term preservation compared to PDF/A. HTML depends on live web resources and is subject to changes in web technologies, which can break links or render certain features unusable over time.

What Are Some Other, Common PDF Preservation Risks?

Archiving malformed PDFs leads to many preservation risks encompassing a range of potential pitfalls, with lasting consequences for the integrity and usability of your archived documents.

Here’s a look at some:

  1. Live PDF text becoming static or unreadable — Insufficient character encoding not only hinders text legibility, but it also impedes reliable machine processing, meaning text is also no longer accurately searchable or extractable.

  2. Issues with images, scanned documents, and drawings — Proprietary formats and obsolete compressions, like LZW, can result in content failing to display correctly or at all, because a future viewer won’t know how to correctly interpret the compressed data.

  3. Inaccurate color representation — The absence of any color management leads to incorrect renderings on different devices. For example, branded imagery, such as a Pepsi logo, will change shades when displayed on different screens or when printed.

  4. Lost document histories and keywords — Future users will face frustration if metadata is non-standard or missing, causing a loss of machine-readable keywords used in discovery and retrieval, along with loss of document origin details.

  5. Expiration of digital signatures — Digital signatures break or become invalidated when certificates expire or trusted authorities close shop, making long-term validation (LTV) a challenge.

  6. External content references that return a 404 error — A PDF, not unlike websites, may also embed a reference to an external resource such as a URL, leading to rendering failure as soon as a link breaks.

What Are the Different PDF/A Conformances and What Do They Do?

Earlier conformances added extra requirements, like Unicode (U) or logical structure (A).

In contrast, level B conformance stands for “basic” and is thus considered the easiest to work with, as it doesn’t require Unicode or logical structure, whereas U requires Unicode, and A requires both Unicode and logical structure. Many documents lack these features, and in the case of logical structure, it’s a time-consuming and manual process to add via PDF tagging.

The two new PDF/A-4 conformances (E and F) add provisions to embed non-PDF/A files where you’re typically restricted to embedding PDF/A files only.

E = engineering — This lets you embed 3D data in PDFs using PRC or U3D formats as annotations.

F = file attachment — This lets you embed files of any other format in your PDFs.

Information

All PDF/A-4 conformances now require Unicode for all text, like under A and U in earlier versions. PDF/A-4 also streamlines conformance by eliminating the A, B, and U.

Share Post
Free 60-Day Trial Try PSPDFKit in your app today.
Free Trial

Related Articles

Explore more
COMPANY  |  Company

New Generative AI-Enabled Integrated Document Workflow Capabilities by PSPDFKit

COMPANY  |  Company • Products • Remote Work • Insights

2023: Year in Review

COMPANY  |  Company

PSPDFKit Acquires Low-Code Process Automation Platform Provider Integrify