Why PDF/A validation matters, even if you don’t have PDF/A

This is the first installment of a 2-part blog (part 2 is here). It was prompted by the upcoming Digital Preservation Coalition briefing When is a PDF not a PDF?, for which I was asked to prepare a presentation. My initial idea was to give an overview of the work we did on PDF preservation risk assessment using a PDF/A validator in the SCAPE project. Most of this has already been covered by a series of earlier blog posts. Those blogs very much represent different stages of a work in progress, and I think this makes them somewhat challenging for readers who are new to the subject.

The purpose of this 2-part blog is twofold: first it is an attempt to give an accessible overview of the earlier work on PDF preservation risks, stressing the importance of PDF/A validator tools in detecting these risks. Second, it provides some tentative suggestions of how the ongoing work on the new VeraPDF PDF/A validator could close some of the gaps and limitations of the SCAPE work.

Preservation risks of PDF

The PDF format has a number of features that don’t sit well with the aims of long-term preservation and accessibility. This includes encryption and password protection, external dependencies (e.g. fonts that are not embedded in a document), and reliance on external software. More details can be found in the PDF entry of the OPF File Format Risk Registry. Below are some examples; I included download links, so you can try them out for yourself.

Document Open password

If you try to open file encryption_openpassword.pdf in Adobe Acrobat, you end up with this dialog:

Without the password, the file cannot be opened at all.

File encryption_noprinting.pdf can be opened normally, but you cannot print it:

Embedded Quicktime movie

File embedded_video_quicktime.pdf contains multimedia content in Quicktime format. Acrobat cannot render this format natively, and relies on an external player. This is what happened when I opened the file on my PC:

After I clicked on Get Media Player, I was taken here:

I wasn’t able to configure Acrobat to use a media player that supports Quicktime 1.

External reference to multimedia file

The file movie.pdf contains references to external multimedia files. If you click on any of them you get an error like this one:

Font not embedded

File calistoMTNoFontsEmbedded.pdf uses Calisto MT, but the font is not embedded. Since Calisto MT is a Windows system font, the file looks fine on my Windows PC:

The font does not come pre-installed with common Linux distros, and as a result the file looks quite a bit different on my Linux machine:

3D content

The file digitally_signed_3D_Portfolio.pdf contains 3D artwork. Acrobat correctly renders the 3D content, which can be manipulated interactively by the user:

However, Acrobat aside, the majority of PDF readers don’t support 3D content, with the result that in other readers you may end up with something like this:

Detecting risky features

Archives or libraries may want to check their PDFs for one or more features like those shown above. Reasons for doing so include:

  • Pre-ingest checks against an institutional policy (e.g. an archive may not accept PDFs that are password protected)

  • Profiling of existing collections for preservation risks (e.g. embedded multimedia content in hard-to-render formats)

For this quite a few useful software tools are already available. For example, qpdf gives detailed information about encryption and password protection:

Similarly, the pdffonts tool that is part of xpdf is useful for checking whether fonts in a PDF are embedded:

As the number of features you want to check for increases, this approach becomes increasingly cumbersome: most of tools only cover some features, so you rapidly end up having to deal with a multitude of software tools and output formats. So you may ask yourself if there’s a way to do this more efficiently.

PDF/A validation

This is where PDF/A enters the picture. The PDF/A standards are nothing more than a set of profiles that impose some restrictions on a PDF, ruling out features that are not well-suited to long-term accessibility. Unsurprisingly, these include the very same features that we are interested in here, such as encryption, non-embedded fonts, multimedia content, and so on. Several tools exist that compare a PDF against PDF/A and report any deviations. These PDF/A validators are typically used to verify “true” PDF/A files; however, they can also be used to detect user-specified risky features in regular PDFs.

The professional version of Adobe Acrobat has a PDF/A validator built into its Preflight tool. After opening a PDF in Acrobat, it allows you to verify its compliance with a number of profiles, including PDF/A (currently A-1, 2 and 3):

This results in output as shown here:

This PDF2 (which isn’t a PDF/A) violates the PDF/A-1a profile in several ways, but supposing we’re only interested in encryption and non-embedded fonts, the relevant information can be extracted from Preflight’s output quite easily. This example demonstrates the overall feasibility of identifying preservation risks with a PDF/A validator, but it is not scalabe to situations where you need to verify large volumes of PDFs. This will be the main focus of the second part of this blog series.

  1. Acrobat’s Preferences do include some options for configuring behavior with multimedia content (explained here), but the list of media players in the Preferred Media Player dropdown list only included Windows Media Player and Adobe Flash Player. Neither of these support Quicktime. VLC Media player does support Quicktime, but it is not included in the dropdown list, leaving me no way to configure it. Bummer!

  2. At the time of writing the Acrobat Engineering site was down, and this particular PDF is not included in any Wayback crawls either. Bummer again!

5 thoughts on “Why PDF/A validation matters, even if you don’t have PDF/A

  1. Pingback: Why PDF/A validation matters – Part 2 | KB Research

    • Hmm, interesting. But does it actually play after clicking on it (your screenshot only shows the initial view after opening, and this looks identical ton what I’m getting myself)? BTW I also tried this with Acrobat Reader XI, for which I’m getting identical results.

      • Yes, it played – the colour bar scrolled horizontally in an old-school-demo fashion (I’ll tweet the screenshots). I’m using Adobe Reader X 10.1.14, but I think it might be being on a Mac that makes the difference?

  2. It seems that (as illustrated by @anjacks0n’s success with one of your example objects) the main issue your decribing is caused by not having an appropriate interaction environment. This can (I think we agree- clearly) be solved by finding and maintaining that interaction environment.
    It does highlight the question of how we identify what that interaction environment might be if we don’t get that information from the content creator (which ideally we should).
    Validation seems useful for identifying what environment you might need. For example you could map “risks” to commonly related interaction environments to try to identify what environment you should be using to interact with the “risky” content.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s