Jpylyzer software finalist voor digitale duurzaamheidsprijs

Vandaag maakte de Britse Digital Preservation Coalition de finalisten bekend die in de race zijn voor de Digital Preservation Awards 2014. Deze prijs is in 2004 in het leven geroepen om aandacht te vestigen op initiatieven die een belangrijke bijdrage leveren aan het toegankelijk houden van digitaal erfgoed.

In de categorie Research and Innovation is een op de KB door de afdeling Onderzoek ontwikkelde softwaretool genomineerd: jpylyzer. Met jpylyzer kun je op een eenvoudige manier controleren of JP2 (JPEG 2000) beeldbestanden technisch in orde zijn. Binnen de KB wordt de tool onder meer ingezet bij de kwaliteitscontrole van gedigitaliseerde boeken, kranten en tijdschriften. Jpylyzer wordt ook gebruikt door diverse internationale collega-instellingen.

Jpylyzer is deels ontwikkeld binnen het Europese project SCAPE, waarin de KB projectpartner is. De uiteindelijke winnaars worden op 17 november bekendgemaakt.

Meer informatie over de nominatie van jpylyzer is te vinden op de website van de Digital Preservation Coalition:

http://www.dpconline.org/newsroom/latest-news/1271-dpa-2014finalists

Het volgende artikel is interessant voor wie meer wil weten over jpylyzer, en waarom we zo’n tool eigenlijk nodig hebben:

http://www.kb.nl/research/kb-onderzoek-het-internationale-succes-van-de-jpylyzer-en-wat-is-dat-eigenlijk-voor-ding

Ten slotte is hier de jpylyzer homepage:

 http://openplanets.github.io/jpylyzer/

Identification of PDF preservation risks: the sequel

Author: Johan van der Knijff
Originally posted on: http://www.openplanetsfoundation.org/blogs/2013-07-25-identification-pdf-preservation-risks-sequel

Last winter I started a first attempt at identifying preservation risks in PDF files using the Apache Preflight PDF/A validator. This work was later followed up by others in two SPRUCE hackathons in Leeds (see this blog post by Peter Cliff) and London (described here). Much of this later work tacitly assumes that Apache Preflight is able to successfully identify features in PDF that are a potential risk for long-term access. This Wiki page on uses and abuses of Preflight (created as part of the final SPRUCE hackathon) even goes as far as stating that “Preflight is thorough and unforgiving (as it should be)“. But what evidence do we have to support such claims? The only evidence that I’m aware of, are the results obtained from a small test corpus of custom-created PDFs. Each PDF in this corpus was created in such a way that it includes only one specific feature that is a potential preservation risk (e.g. encryption, non-embedded fonts, and so on). However, PDFs that exist ‘in the wild’ are usually more complex. Also, the PDF specification often allows you to implement similar features in subtly different ways. For these reasons, it is essential to obtain additional evidence of Preflight‘s ability to detect ‘risky’ features before relying on this tool in any operational setting.

Adobe Acrobat Engineering test files

Shortly after I completed my initial tests, Adobe released the Acrobat Engineering website, which contains a large volume of test documents that are used by Adobe for testing their products. Although the test documents are not fully annotated, they are subdivided into categories such as Multimedia & 3D Tests and Font tests. This makes these files particularly useful for additional tests on Preflight.

Methodology

The general methodology I used to analyse these files is identical to what I did in my 2012 report: first, each PDF was validated using Apache Preflight. As a control I also validated the PDFs with the Preflight component of Adobe Acrobat, using the PDF/A-1b profile. The table below lists the software versions used:

Software Version
Apache Preflight 2.0.0
Adobe Acrobat 10.14
Acrobat Preflight 10.1.3 (090)

Re-analysis of PDF Cabinet of Horrors corpus

Because the current analysis is based on a more recent version of Apache Preflight than the one used in the 2012 report (which was 1.8.0), I first re-ran the analysis of the PDFs in the PDF Cabinet of Horrors corpus. The main results are reproduced here. The main differences with respect to that earlier version are:

  1. Apache Preflight now has an option to produce output in XML format (as suggested by William Palmer following the Leeds SPRUCE hackathon)
  2. Better reporting of non-embedded fonts (see also this issue)
  3. Unlike the earlier version, Preflight 2.0.0 does not give any meaningful output in case of encrypted and password-protected PDFs! This is probably a bug, for which I submitted a report here.

Analysis Acrobat Engineering PDFs

Since the Acrobat Engineering site hosts a lot of PDFs, I only focused on a limited subset for the current analysis:

  1. all files in the General section of the Font Testing category;
  2. all files in the Classic Multimedia section of the Multimedia & 3D Tests category.

The results are summarized in two tables (see next sections). For each analysed PDF, the table lists:

  • the error(s) reported by Adobe Acrobat Preflight;
  • the error code(s) reported by Apache Preflight (see Preflight’s source code for a listing of all possible error codes);
  • the error description(s) reported by Apache Preflight in the details output element.

For the sake of readability, the tables only list those error messages/codes that are directly related to font problems, multimedia, encryption and JavaScript. The full output for all tested files can be found here.

Fonts

The table below summarizes the results of the PDFs in the Font Testing category:

Test file Acrobat Preflight error(s) Apache Preflight Error Code(s) Apache Preflight Details
EmbeddedCmap.pdf Font not embedded (and text rendering mode not 3) ; Glyphs missing in embedded font 3.1.3 Invalid Font definition, FontFile entry is missing from FontDescriptor for HeiseiKakuGo-W5
TEXT.pdf Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font ; TrueType font has differences to standard encodings but is not a symbolic font; Wrong encoding for non-symbolic TrueType font 3.1.5; 3.1.1; 3.1.2; 3.1.3; 3.2.4 Invalid Font definition, The Encoding is invalid for the NonSymbolic TTF; Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Arial,Italic(repeated for other fonts); Font damaged, The CharProcs references an element which can’t be read
Type3_WWW-HTML.PDF 3.1.6 Invalid Font definition, The character with CID”58″ should have a width equals to 15.56599 (repeated for other fonts)
embedded_fonts.pdf Font not embedded (and text rendering mode not 3); Type 2 CID font: CIDToGIDMap invalid or missing 3.1.9; 3.1.11 Invalid Font definition; Invalid Font definition, The CIDSet entry is missing for the Composite Subset
embedded_pm65.pdf 3.1.6 Invalid Font definition, Width of the character “110” in the font program “HKPLIB+AdobeCorpID-MyriadRg”is inconsistent with the width in the PDF dictionary (repeated for other font)
notembedded_pm65.pdf Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font 3.1.3 Invalid Font definition, FontFile entry is missing from FontDescriptor for TimesNewRoman (repeated for other fonts)
printtestfont_nonopt.pdf* ICC profile is not valid; ICC profile is version 4.0 or newer; ICC profile uses invalid color space;ICC profile uses invalid type Preflight throws exception (exceptionThrown), exits with message ‘Invalid ICC Profile Data’
printtestfont_opt.pdf* ICC profile is not valid; ICC profile is version 4.0 or newer; ICC profile uses invalid color space; ICC profile uses invalid type Preflight throws exception (exceptionThrown), exits with message ‘Invalid ICC Profile Data’
substitution_fonts.pdf Font not embedded (and text rendering mode not 3) 3.1.1; 3.1.2; 3.1.3 Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Souvenir-Light(repeated for other fonts)
text_images_pdf1.2.pdf Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font; Width information for rendered glyphs is inconsistent 3.1.1; 3.1.2 Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor

As this document doesn’t appear to have any font-related issues it’s unclear why it is in the Font Testing category. Errors related to ICC profiles reproduced here because of relevance to Apache Preflight exception.

General observations

An intercomparison between the results of Acrobat Preflight and Apache Preflight shows that Apache Preflight’s output may vary in case of non-embedded fonts. In most cases it produces error code 3.1.3 (as was the case with the PDF Cabinet of Horrors dataset), but other errors in the 3.1.x range may occur as well. The 3.1.6 “character width” error is something that was also encountered during the London SPRUCE Hackathon, and according to the information here this is most likely the result of the PDF/A specification not being particularly clear. So, this looks like a non-serious error that can be safely ignored in most cases.

Multimedia

The next table shows the results for Multimedia & 3D Tests category:

Test file Acrobat Preflight error(s) Apache Preflight Error Code(s) Apache Preflight Details
20020402_CALOS.pdf 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
Disney-Flash.pdf Contains action of type JavaScript; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Form field does not have appearance dict; Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia-related errors; Preflight did report syntax and body syntax error
Jpeg_linked.pdf Document is encrypted; Encrypt key present in file trailer; Named action with a value other than standard page navigation used; Incorrect annotation type used (not allowed in PDF/A); Font not embedded (and text rendering mode not 3) 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
MultiMedia_Acro6.pdf Document is encrypted; EmbeddedFiles entry in Names dictionary; Encrypt key present in file trailer; PDF contains EF (embedded file) entry; Incorrect annotation type used (not allowed in PDF/A) 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
MusicalScore.pdf CIDset in subset font is incomplete; CIDset in subset font missing; Contains action of type JavaScript; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry; Type 2 CID font: CIDToGIDMap invalid or missing 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
SVG-AnnotAnim.pdf Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 5.2.1; 1.2.9 Forbidden field in an annotation definition, The subtype isn’t authorized : SVG; Body Syntax error, EmbeddedFile entry is present in a FileSpecification dictionary
SVG.pdf Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
ScriptEvents.pdf Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
Service Form_media.pdf Contains action of type JavaScript; Contains action of type ResetForm; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font; Incorrect annotation type used (not allowed in PDF/A); Named action with a value other than standard page navigation used; PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
Trophy.pdf Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
VolvoS40V50-Full.pdf Preflight exits with: “An error occurred while parsing a contents stream. Unable to analyze the PDF file” 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
gXsummer2004-stream.pdf File cannot be loaded in Acrobat (damaged file) 1.0; 1.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
phlmapbeta7.pdf Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
us_population.pdf Preflight exits with: “An error occurred while parsing a contents stream. Unable to analyze the PDF file” 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
movie.pdf Incorrect annotation type used (not allowed in PDF/A) 5.2.1 Forbidden field in an annotation definition, The subtype isn’t authorized : Movie
movie_down1.pdf Incorrect annotation type used (not allowed in PDF/A) 5.2.1 Forbidden field in an annotation definition, The subtype isn’t authorized : Movie
remotemovieurl.pdf Font not embedded (and text rendering mode not 3); Incorrect annotation type used (not allowed in PDF/A) 5.2.1; 3.1.1; 3.1.2; 3.1.3 Forbidden field in an annotation definition, The subtype isn’t authorized : Movie; Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Arial

General observations

The results from the Multimedia PDFs are interesting for several reasons. First of all, these files include a wide variety of ‘risky’ features, such as multimedia content, embedded files, JavaScript, non-embedded fonts and encryption. These were successfully identified by Acrobat Preflight in most cases. Apache Preflight, on the other hand, only reported non-specific and fairly uninformative errors (1.0 + 1.2.1) for 12 out of 17 files. Even thoughPreflight was correct in establishing that these files were not valid PDF/A-1b, it wasn’t able to drill down to the level of specific features for the majority of these files.

Summary and conclusions

The re-analysis of the PDF Cabinet of Horrors corpus, and the subsequent analysis of a sub-set of the Adobe Acrobat Engineering PDFs shows a number of things. First, Apache Preflight 2.0.0 does not properly identify encryption and password-protection. This looks like a bug that is probably easily fixed. Second, the analysis of theFont Testing PDFs from the Acrobat Engineering site revealed that non-embedded fonts may result in a variety of error codes in Apache Preflight (assuming here that the Acrobat Preflight results are accurate). So, when usingApache Preflight to check font embedding, it’s probably a good idea to treat all font-related errors (perhaps with the exception of character width errors) as a potential risk. The more complex PDFs in the Multimedia category proved to be quite challenging to Apache Preflight: for most files here, it was not able to identify specific features such as multimedia content, embedded files, JavaScript and non-embedded fonts. This is not necessarily a problem if Apache Preflight is used for its intended purpose: verify if a PDF conforms to PDF/A-1. However, it does rather limit its use as a tool for profiling heterogeneous PDF collections for specific preservation risks at this stage. This may well change with future versions; in fact the specificity of Preflight‘s validation output already improved considerably since version 1.8.0. In the meantime it’s important to keep the expectations about the tool’s capabilities realistic, in order to avoid some potential unintended misuses.

Links

EPUB for archival preservation: an update

Author: Johan van der Knijff
Originally posted on: http://www.openplanetsfoundation.org/blogs/2013-05-23-epub-archival-preservation-update

Last year (2012) the KB released a report on the suitability of the EPUB format for archival preservation. A substantial number of EPUB-related developments have happened since then, and as a result some of the report’s findings and conclusions have become outdated. This applies in particular to the observations onEPUB 3, and the support of EPUB by characterisation tools. This blog post provides an update to those findings. It addresses the following topics in particular:

  • Use of EPUB in scholarly publishing
  • Adoption and use of EPUB 3
  • EPUB 3 reader support
  • Support of EPUB by characterisation tools

In the following sections I will briefly summarise the main developments in each of these areas, after which I will wrap up things in a concluding section.

Use of EPUB in scholarly publishing

Although scholarly publishing is still dominated by PDF, the use of EPUB in this sector is on the rise. This blog post by Todd Carpenter gives the following examples:

At the time of writing, the above publishers are all using EPUB 2.

Adoption and use of EPUB 3

Over the last year a number of organisations that are representing the publishing industry have expressed their support of EPUB 3. The Book Industry Study Group (BISG) is a trade association for companies in the publishing industry. Last year (August 2012) BISG released a policy statement in which it endorsed “EPUB 3 as the accepted and preferred standard for representing, packaging, and encoding structured and semantically enhanced Web content — including XHTML, CSS, SVG, images, and other resources — for distribution in a single-file format“. Early this year (March 2013) the International Publishers Association (IPA) issued a press releasethat also endorsed EPUB 3 as a “preferred standard format for representing HTML and other web content for distribution as single-file publications“. IPA represents over 60 national publishing organisations from more than 50 countries. Finally, the European Booksellers Federation recently released a report on the interoperability of eBook Formats. Its authors did a comparison of the features and functionality provided by EPUB 3, Amazon’s KF8 (Kindle) and Apple’s e-book formats. They concluded that EPUB 3 “clearly covers the superset of the expressive abilities of all the formats“, and that there is “no technical or functional reason not to use and establish EPUB 3 as an/the interoperable (open) ebook format standard“. This all suggests that EPUB 3 is widely supported by the publishing industry.

Having said that, the actual use of EPUB 3 is still limited at this stage, even though some publishers have already started using the format. Earlier this year technical publisher O’Reilly started releasing all their new eBook bundles in EPUB 3 format. The announcement mentions that their backlist will be updated as well. Interestingly, they decided to create “hybrid” EPUBs that are backward-compatible with EPUB 2. In November 2012 publisher Hachette also announced the launch of their EPUB 3 program.

EPUB 3 reader support

At this time reader support for EPUB 3 is still limited, but there have been a number of significant developments since the second half of 2012:

Support of EPUB by characterisation tools

The 2012 report concluded that EPUB was not optimally supported by characterisation tools. This situation has improved quite a lot since that time.

Identification

EPUB is now included in PRONOM, and has a corresponding DROID signature. This means that Fido should now be able to identify the format as well. On a side note, PRONOM doesn’t differentiate between EPUB 2 and 3, and it appears that the current record (which is only an outline record anyway) either combines both versions, or only refers to EPUB 2. PRONOM should probably be more specific on this.

Validation and feature extraction

The 2012 report included tests of 2 EPUB validator tools: epubcheck and flightcrew. While testing epubcheck in 2012, I was’t entirely happy with the rather unstructured output that the tool produced. Also, I couldn’t find any tool that was capable of extracting technical meta-information about an EPUB, like the presence of encryption or other digital rights management technology (feature extraction). Happily, starting with version 3.0 epubcheck is capable of extracting this kind of information. Moreover, it added an option to report its output in structured XML format that follows the JHOVE schema. I haven’t done any elaborate testing, but a quick run on some ofthese EPUB 3 samples showed that epubcheck was able to identify font obfuscation, in which case a property hasEncryption (valuetrue) is reported. I wasn’t able to find any EPUB files with DRM, so I cannot confirm if epubcheck detects this as well.

Flightcrew

As for flightcrew, no new versions of that tool have been released since August 2011, and it looks like it is not under any active development.

Discussion and conclusions

Since the release of the KB report on the suitability of EPUB for archival preservation the EPUB landscape has changed rather a lot. First, a number of academic publishers have started to offer scholarly content in this format. Although EPUB 3 is still in its early stages, various organisations representing the publishing industry have explicitly expressed their support of EPUB 3. A number of software applications now exist that are able to read the format, and work on a high-performance open source EPUB 3 Software Development Kit is backed by major players in the digital publishing industry (including e-reader manufacturers such as Kobo and Sony). EPUB support by characterisation tools has improved as well, mostly thanks to a number of recent enhancements ofepubcheck. So, overall, EPUB‘s credentials as a preservation format appear to have improved quite a bit over the last year. In the case of EPUB 3 it’s still too early to say anything about actual adoption, but the conditions for adoption to happen look pretty favourable. This is something I will get back to in my next update, perhaps in another year from now.

Useful links