Jpylyzer software finalist voor digitale duurzaamheidsprijs

Vandaag maakte de Britse Digital Preservation Coalition de finalisten bekend die in de race zijn voor de Digital Preservation Awards 2014. Deze prijs is in 2004 in het leven geroepen om aandacht te vestigen op initiatieven die een belangrijke bijdrage leveren aan het toegankelijk houden van digitaal erfgoed.

In de categorie Research and Innovation is een op de KB door de afdeling Onderzoek ontwikkelde softwaretool genomineerd: jpylyzer. Met jpylyzer kun je op een eenvoudige manier controleren of JP2 (JPEG 2000) beeldbestanden technisch in orde zijn. Binnen de KB wordt de tool onder meer ingezet bij de kwaliteitscontrole van gedigitaliseerde boeken, kranten en tijdschriften. Jpylyzer wordt ook gebruikt door diverse internationale collega-instellingen.

Jpylyzer is deels ontwikkeld binnen het Europese project SCAPE, waarin de KB projectpartner is. De uiteindelijke winnaars worden op 17 november bekendgemaakt.

Meer informatie over de nominatie van jpylyzer is te vinden op de website van de Digital Preservation Coalition:

http://www.dpconline.org/newsroom/latest-news/1271-dpa-2014finalists

Het volgende artikel is interessant voor wie meer wil weten over jpylyzer, en waarom we zo’n tool eigenlijk nodig hebben:

http://www.kb.nl/research/kb-onderzoek-het-internationale-succes-van-de-jpylyzer-en-wat-is-dat-eigenlijk-voor-ding

Ten slotte is hier de jpylyzer homepage:

 http://openplanets.github.io/jpylyzer/

Too early for audits?

Author: Barbara Sierman
Originally posted on: http://digitalpreservation.nl/seeds/too-early-for-audits/

I never realized that the procedure of getting to an ISO standard could take several years, but this is true for two standards related to audit and certification of trustworthy digital repositories.  Although we have the ISO 16363 standard on Audit and Certification since 2012, official audits cannot take place against this standard until the related standard Requirements for bodies providing Audit and Certification (ISO 16919) is approved, regulating the appointment of auditors. This standard, similar to the ISO 16363 compiled by the PTAB group in which I participate, was already finished a few years ago, but the ISO review procedure, especially when revisions need to be made, takes long. The latest prediction is that this summer (2014) the ISO 16919 will be approved, after which national standardization bodies can train the future (official) auditors.  How many organizations will then apply for an official certification against the ISO standard is not yet clear, but if you’re planning to do so, it might be worthwhile to have a look at the recent report of the European 4C project  Quality and trustworthiness as economic determinants in digital curation.

The 4C project (Collaboration to Clarify the Cost of Curation) is looking at the costs and benefits of digital curation. Trustworthiness is one of the “economic determinants” of the 15 they distinguish. As quality is seen as a precondition for trustworthiness, the 4C project focusses in this report on the costs and benefits of “standards based quality assurance” and looks at the 5 current standards related to audit and certification: DSA, Drambora, DIN 31644 of the German nestor group, TRAC and TDR. The first part of the report gives an overview of the current status of these standards. Woven in this overview are some interesting thoughts about audit and certification. It all starts with the Open Archival Information System (OAIS) Reference Model. The report suggests that the OAIS model is there to help organisations to create processes and workflows (page 18), but I think this does not right to the OAIS model. If one really reads the OAIS standard from cover to cover (and should not we all do that regularly?) one will recognize that the OAIS model expects a repository to do more than designing workflows and processes. Instead, a repository needs to develop a vision on how to do digital preservation and the OAIS model gives directions. But the OAIS model is not a book of recipes and we all are trying to find the best way to translate OAIS into practice. It is this lack of evidence which approach will offer the best preserved digital objects, that made the authors in the report wonder whether an audit that will take place now might lead to a risky outcome (either too much confidence in the repository or too little). They use the phrase “dispositional trust” . “It is the trustor’s belief that it will have a certain goal B in the future and, whenever it will have such a goal and certain conditions obtain, the trustee will perform A and thereby will ensure B.”(p. 22). We expect that our actions will lead to a good result in the future, but this is uncertain as we don’t have an agreed common approach with evidence that this approach will be successful.  This is a good point to keep in mind I think as well as the fact that there are many more standards applicable for digital preservation then only the above mentioned. Security standards, record management standards and standards related to the creation of the digital object, to name just a few.

Based on publicly available audit reports (mainly TRAC and DSA, and test audits on TDR) the report describes the main benefits of audits for organisations as

  • to improve the work processes,
  • to meet a contractual obligation and
  • to provide a publicly understandable statement of quality and reliability (p. 29).

These benefits are rather vague but one could argue that these vague notions might lead to more tangible benefits in the future like more (paying) depositors, more funding, etc. By the way, one of the benefits recognized in the test audits was the process of peer review in itself and the ability for the repository management to discuss the daily practices with knowledgeable people.

The authors also tried to get more information about costs related to audit and certification, but had to admit in the end that currently there is hardly any information about the actual costs of an audit and/or get certified (why they mention on page 23 financial figures of 2 specific audits without any context is unclear to me) and base themselves mainly on information that was collected during the test audits that the APARSEN project performed and the taxonomy of costs that was created. For costs we need to wait for more audits and for repositories that are willing to publish all their costs in relation to this exercise.

Reading between the lines,  one could easily conclude that it is not recommended to perform audits yet. But especially now the DP community is working hard to discover the best way to protect digital material, it is important for any repository to protect their investments and to avoid that current funding organizations (often tax payers) will back off because of costly mistakes. The APARSEN trial audits were performed by experts in the field and the audited organizations (and these experts) found the discussions and recommendations valuable. As standards are evolving and best practices and tools are developed, a regular audit by experts in the field can certainly safeguard organizations to minimize the risk for the material. These expert auditors need to be aware of the current state of digital preservation, the uncertainties, the risks, the lack of tools and the best practices that are there. The audit results  will help the community to understand the issues encountered by the audited organizations, as audit results will be published.

As I noticed while reading a lot of preservation policies for SCAPE, many organisations want to get certified and put this aim in their policies. Publishers want to have their data and publications in trustworthy, certified repositories. But all stakeholders (funders, auditors, repository management) should realise that the outcomes of an audit should be seen in the light of the current state of digital preservation: that of pioneering.

Breaking down walls in digital preservation (Part 2)

Here is part 2 of the digital preservation seminar which identified ways to break down walls between research & development and daily operations in libraries and archives (continued from Breaking down walls in digital preservation, part 1). The seminar was organised by SCAPE and the Open Planets Foundation in The Hague on 2 April 2014. – Report & photographs by Inge Angevaare, visualisations by Elco van Staveren

Ross King picture wall between daily operations and research and development in digital preservation

Ross King of the Austrian Institute of Technology (and of OPF) kicking off the afternoon session by singlehandedly attacking the wall between daily operations and R&D

Experts meet managers

Ross King of the Austrian Institute of Technology described the features of the (technical) SCAPE project which intends to help institutions build preservation environments which are scalable – to bigger files, to more heterogeneous files, to a large volume of files to be processed. King was the one who identified the wall that exists between daily operations in the digital library and research & development (in digital preservation):

The wall between Production and R&D

The Wall between Production & R&D as identified by Ross King

Zoltan Szatucsket of the Hungarian National Archives shared his experiences with one of the SCAPE tools from a manager’s point of view: ‘Even trying out the Matchbox tool from the SCAPE project was too expensive for us.’ King admitted that the Matchbox case had not yet been entirely successful. ‘But our goal remains to deliver tools that can be downloaded and used in practice.’

Maureen Pennock of the British Library sketched her organisation’s journey to embed digital preservation [link to slides to follow]. Her own digital preservation department (now at 6 fte) was moved around a few times before it was nested in the Collection Care department which was then merged with Collection management. ‘We are now where we should be: in the middle of the Collections department and right next to the Document processing department. And we work closely with IT, strategy development, procurement/licensing and collection security and risk management.’

British Library strategy for digital preservation

The British Library’s strategy calls for further embedding of digital preservation, without taking the formal step of certification

Pennock elaborated on the strategic priorities mentioned above (see slides) by noting that the British Library has chosen not to strive for formal certification within the European Framework (unlike, e.g., the Dutch KB). Instead, the BL intends to hold bi-annual audits to measure progress. The BL intends to ensure that ‘all staff working with digital content understand preservation issues associated with it.’ Questioned by the Dutch KB’s Hildelies Balk, Pennock confirmed that the teaching materials the BL is preparing could well be shared with the wider digital preservation community. Here is Pennock’s concluding comment:

Digital preservation is like a bicycle - one size does not fit everyone, but everyone still recognises it as a library

Digital preservation is like a bicycle – one size doesn’t fit everyone … but everybody still recognises the bicycle

Marcin Werla from the Polish Supercomputing & Networking Centre PSNC provided an overview of the infrastructure PSNC is providing for research institutions and for cultural heritage institutions. It is a distributed network based on the Polish fast (20GB) optical network:

PSCN network for digital libraries and archives

The PSNC network includes facilities for long-term preservation

Interestingly, the network services mostly smaller institutions. The Polish National Library and Archives have built their own systems.

Werla stressed that proper quality control at the production stage is difficult because of the bureaucratic Polish public procurement system.

Heiko Tjalsma of the Dutch research data archive DANS pitched the 4C project which was established to  ‘create a better understanding of digital curation costs through collaboration.’

Heiko Tjalsma about the 4C Project to get a grip on digital curation costs

Tjalsma: ‘We can only get a better idea of what digital curation costs by collaborating and sharing data’

At the moment there are several cost models available in the community (see, e.g., earlier posts), but they are difficult to compare. The 4C project intends to a) establish an international curation cost exchange framework, and b) build a Cost Concept Model – which will define what to include in the model and what to exclude.

The need for a clearer picture of curation costs is undisputed, but, Tjalsma added, ‘it is very difficult to gather detailed data, even from colleagues.’ Our organisations are reticent to make their financial data available. And both ‘time’ and ‘scale’ make matter more difficult. The only way to go seems to be anonimisation of data, and for that to work, the project must attract as many participants as possible. So: please register at http://www.4cproject.eu/community-resources/stakeholder-participation – and participate.

Building bridges between expert and manager

The last part of the day was devoted to building bridges between experts and managers. Dirk van Suchodeletz of the University of Freiburg introduced the session with a topic that is often considered an ‘expert-only’ topic: emulation.

Dirk von Suchodeletz

Dirk von Suchodeletz: ‘The EaaS project intends to make emulation available for a wider audience by providing it as a service.’

The emulation technique has been around for a while, and it is considered one of the few methods of preservation available for very complex digital objects – but takeup by the community has been slow, because it is seen as too complex for non-experts. The Emulation as a Service project intends to bridge the gap to practical implementation by taking away many of the technical worries from memory institutions. A demo of Emulation as a Service is available for OPF members. Von Suchodeletz encouraged his audience to have a look at it, because the service can only be made to work if many memory institutions decide to participate.

Seminar round table Managing Digital Preservation

Getting ready for the last roundtable discussion about the relationship between experts and managers

How R&D and the library business relate

‘What inspired the EaaS project,’ Hildelies Balk (KB) wanted to know from von Suchodeletz, ‘was it your own interest or was there some business requirement to be met?’ Von Suchodeletz admitted that it was his own research interest that kicked off the project; business requirements entered the picture later.

Birgit Henriksen of the Royal Library, Denmark: ‘We desperately need emulation to preserve the games in our collection, but because it is such a niche, funding is hard to come by.’ Jacqueline Slats of the Dutch National Archives echoed this observation: ‘The NA and the KB together developed the emulation tool Dioscuri, but because there was no business demand, development was halted. We may pick it up again as soon as we start receiving interactive material for preservation.’

This is what happened next, as visualised by Elco van Staveren:

Some highlights from the discussions:

  • Timing is of the essence. Obviously, R&D is always ahead of operations, but if it is too far ahead, funding will be difficult. Following user needs is no good either, because then R&D becomes mere procurement. Are there any cases of proper just-in-time development? Barbara Sierman of the KB suggested Jpylyzer (translation of Jpylyzer for managers) – the need arose for quality control in a massive TIFF/JP2000 migration at the KB intended to cut costs, and R&D delivered.
  • Another successful implementation: the Pronom registry. The National Archives had a clear business case for developing it. On the other hand, the GDFR technical registry did not tick the boxes of timeliness, impetus and context.
  • For experts and managers to work well together managers must start accepting a certain amount of failure. We are breaking new ground in digital preservation, failures are inevitable. Can we make managers understand that even failures make us stronger because the organisation gains a lot of experience and knowledge? And what is an acceptable failure rate? Henriksen suggested that managing expectations can do the trick. ‘Do not expect perfection.’

    Seminar managing digital preservation panel members

    Some of the panel members (from left to right) Maureen Pennock (British Library), Hildelies Balk (KB), Mies Langelaar (Rotterdam Municipal Archives), Barbara Sierman (KB) and Mette van Essen (Dutch National Archives)

  • We need a new set of metrics to define success in the ever changing digital world.
  • Positioning the R&D department within Collections can help make collaboration between the two more effective (Andersen, Pennock). Henriksen: ‘At the Danish Royal Library we have started involving both R&D and collections staff in scoping projects.’
  • And then again … von Suchodeletz suggested that sometimes a loose coupling between R&D and business can be more effective, because staff in operations can get too bogged down by daily worries.
  • Sometimes breaking down the wall is just too much to ask, suggested van Essen. We may have to decide to jump the wall instead, at least for the time being.
  • Bridge builders can be key to making projects succeed, staff members who speak both the languages of operations and of R&D. Balk and Pennock stressed that everybody in the organisation should know about the basics of digital preservation.
  • Underneath all of the organisation’s doings must lie a clear common vision to inspire individual actions, projects and collaboration.

In conclusion: participants agreed that this seminar had been a fruitful counterweight to technical hackatons in digital preservation. More seminars may follow. If you participated (or read these blogs), please use the commentary box for any corrections and/or follow-up.

‘In an ever changing digital world, we must allow for projects to fail – even failures bring us lots of knowledge.’

 

Breaking down walls in digital preservation (Part 1)

People & knowledge are the keys to breaking down the walls between daily operations and digital preservation (DP) within our organisations. DP is not a technical issue, but information technology must be embraced as as a core feature of the digital library. Such were some of the conclusions of the seminar organised by the SCAPE project/Open Planets Foundation at the Dutch National Library (KB) and National Archives (NA) on Wednesday 2 April. – Report & photographs by Inge Angevaare, visualisations by Elco van Staveren

Newcomer questions some current practices

Menno Rasch (KB)

Menno Rasch (KB): ‘Do correct me if I am wrong’

Menno Rasch was appointed Head of Operations at the Dutch KB 6 months ago – but  ‘I still feel like a newcomer in digital preservation.’ His division includes the Collection Care department which is responsible for DP. But there are close working relationships with the Research and IT departments in the Innovation Division. Rasch’s presentation about embedding DP in business practices in the KB posed some provocative questions:

  • We have a tendency to cover up our mistakes and failures rather than expose them and discuss them in order to learn as a community. That is what pilots do. The platform is there, the Atlas of Digital Damages set up by the KB’s Barbara Sierman, but it is being underused. Of course lots of data are protected by copyright or privacy regulations, but there surely must be some way to anonimise the data.
  • In libraries and archives, we still look upon IT as ‘the guys that make tools for us’. ‘But IT = the digital library.’
  • We need to become more pragmatic. Implementing the OAIS standard is a lot of work – perhaps it is better to take this one step at a time.
  • ‘If you don’t do it now, you won’t do it a year from now.’
  • ‘Any software we build is temporary – so keep the data, not the software.’
  • Most metadata are reproducible – so why not store them in a separate database and put only the most essential preservation metadata in the OAIS information package? That way we can continue improving the metadata. Of course these must be backed up too (an annual snapshot?), but may tolerate a less expensive storage regime than the objects.
  • About developments at the KB: ‘To replace our old DIAS system, we are now developing software to handle all of our digital objects – which is an enormous challenge.’
SCAPE/OPF seminar on managing digital preservation, 4 April 2014, The Hague

The SCAPE/OPF seminar on Managing Digital Preservation, 2 April 2014, The Hague

Digital collections and the Titanic

Zoltan Szatucsket from the Hungarian National Archives used the Titanic for his presentation’s metaphor – without necessarily implying that we are headed for the proverbial iceberg, he added. Although, …  ‘many elements from the Titanic story can illustrate how we think’:

  • Titanic received many warnings about ice formations, and yet it was sailing at full speed when disaster struck.
  • Our ship – the organisation – is quite conservative. It wants to deal with digital records in the same way it deals with paper records. And at the Hungarian National Archives IT and archivist staff are in the same department, which does not work because they do not speak each others’ language.

    Zoltan Szatucsket SCAPESeminar

    Zoltan Szatucsket argued that putting together IT staff and archivists in the Hungarian National Archives caused ‘language’  problems; his Danish colleagues felt that in their case close proximity had rather helped improve communications

  • The captain must acquire new competences. He must learn to manage staff, funding, technology, equipment, etc. We need processes rather than tools.
  • The crew is in trouble too. Their education has not adapted to digital practices. Underfunding in the sector is a big issue. Strangely enough, staff working with medieval resources were much quicker to adopt digital practices than those working with contemporary material. They seem to want to put off any action until legal transfer to the archives actually occurs (15-20 years).
  • Echoing Menno Rasch’s presentation, Szatucsket asked the rhetorical question: ‘Why do we not learn from our mistakes?’ A few months after Titanic, another ship went down in similar circumstances
  • Without proper metadata, objects are lost forever.
  • Last but not least: we have learned that digital preservation is not a technical challenge. We need to create a complete environment in which to preserve.
Szatucsek at Digital Preservation seminar

Is DP heading for the iceberg as well? Visualisation of Szatucsek’s presentation.

OPF: trust, confidence & communication

Ed Fay was appointed director of the Open Planets Foundation (OPF) only six weeks ago. But he presented a clear vision of how the OPF should function within the community, crack in the middle, as a steward of tools, a champion of open communications, trust & confidence, a broker between commercial and non-commercial interests:

Ed Fay Open Planets Foundation vision

Ed Fay’s vision of the Open Planets Foundation’s role in the digital preservation community

Fay also shared some of his experiences in his former job at the London School of Economics:

Ed Fay London School of Economics Organisation

Ed Fay illustrated how digital preservation was moved around a few times in the London School of Economics Library, until it found its present place in the Library division

So, what works, what doesn’t?

The first round-table discussion was introduced by Bjarne Anderson of the Statsbiblioteket Aarhus (DK). He sketched his institution’s experiences in embedding digital preservation.

Bjarene Andersen Statsbiblioteket Aarhus

Bjarne Andersen (right) conferring with Birgit Henriksen (Danish Royal Library, left) and Jan Dalsten Sorensen (Danish National Archives. ‘SCRUM has helped move things along’

He mentioned the recently introduced SCRUM-based methodology as really having helped to move things along – it is an agile way of working which allows for flexibility. The concept of ‘user stories’ helps to make staff think about the ‘why’. Menno Rasch (KB) agreed: ‘SCRUM works especially well if you are not certain where to go. It is a step-by-step methodology.’

Some other lessons learned at Aarhus:

  • The responsibility for digital preservation cannot be with the developers implementing the technical solutions
  • The responsibility needs to be close to ‘the library’
  • Don’t split the analogue and digital library entirely – the two have quite a lot in common
  • IT development and research are necessary activities to keep up with a changing landscape of technology
  • Changing the organisation a few times over the years helped us educate the staff by bringing traditional collection/library staff close to IT for a period of time.
SCAPE seminar group discussion

Group discussion. From the left: Jan Dalsten Sorensen (DK), Ed Fay (OPF), Menno Rasch (KB), Marcin Werla (PL), Bjarne Andersen (DK), Elco van Staveren (KB, visualising the discussion), Hildelies Balk (KB) and Ross King (Austria)

And here is how Elco van Staveren visualised the group discussion in real time:

Some highlights from the discussion:

  • Embedding digital preservation is about people
  • It really requires open communication channels.
  • A hierarchical organisation and/or an organisation with silos only builds up the wall. Engaged leadership is called for. And result-oriented incentives for staff rather than hierarchical incentives.
  • Embedding digital preservation in the organisation requires a vision that is shared by all.
  • Clear responsibilities must be defined.
  • Move the budgets to where the challenges are.
  • The organisation’s size may be a relevant factor in deciding how to organise DP. In large organisations, the wheels move slowly (no. of staff in the Hungarian National Archives 700; British Library 1,500; Austrian National Library 400; KB Netherlands 300, London School of Economics 120, Statsbiblioteket Aarhus 200).
  • Most organisations favour bringing analogue and digital together as much as possible.
  • When it comes to IT experts and librarians/archivists learning each other’s languages, it was suggested that maybe hard IT staff need not get too deeply involved in library issues – in fact, some IT staff might consider it bad for their careers. Software developers, however, do need to get involved in library/archive affairs.
  • Management must also be taught the language of the digital library and digital preservation.

(Continued in Breaking down walls in digital preservation, part 2)

Seminar agenda and links to presentations

Keep Calm 'cause Titanic is Unsinkable

Identification of PDF preservation risks: the sequel

Author: Johan van der Knijff
Originally posted on: http://www.openplanetsfoundation.org/blogs/2013-07-25-identification-pdf-preservation-risks-sequel

Last winter I started a first attempt at identifying preservation risks in PDF files using the Apache Preflight PDF/A validator. This work was later followed up by others in two SPRUCE hackathons in Leeds (see this blog post by Peter Cliff) and London (described here). Much of this later work tacitly assumes that Apache Preflight is able to successfully identify features in PDF that are a potential risk for long-term access. This Wiki page on uses and abuses of Preflight (created as part of the final SPRUCE hackathon) even goes as far as stating that “Preflight is thorough and unforgiving (as it should be)“. But what evidence do we have to support such claims? The only evidence that I’m aware of, are the results obtained from a small test corpus of custom-created PDFs. Each PDF in this corpus was created in such a way that it includes only one specific feature that is a potential preservation risk (e.g. encryption, non-embedded fonts, and so on). However, PDFs that exist ‘in the wild’ are usually more complex. Also, the PDF specification often allows you to implement similar features in subtly different ways. For these reasons, it is essential to obtain additional evidence of Preflight‘s ability to detect ‘risky’ features before relying on this tool in any operational setting.

Adobe Acrobat Engineering test files

Shortly after I completed my initial tests, Adobe released the Acrobat Engineering website, which contains a large volume of test documents that are used by Adobe for testing their products. Although the test documents are not fully annotated, they are subdivided into categories such as Multimedia & 3D Tests and Font tests. This makes these files particularly useful for additional tests on Preflight.

Methodology

The general methodology I used to analyse these files is identical to what I did in my 2012 report: first, each PDF was validated using Apache Preflight. As a control I also validated the PDFs with the Preflight component of Adobe Acrobat, using the PDF/A-1b profile. The table below lists the software versions used:

Software Version
Apache Preflight 2.0.0
Adobe Acrobat 10.14
Acrobat Preflight 10.1.3 (090)

Re-analysis of PDF Cabinet of Horrors corpus

Because the current analysis is based on a more recent version of Apache Preflight than the one used in the 2012 report (which was 1.8.0), I first re-ran the analysis of the PDFs in the PDF Cabinet of Horrors corpus. The main results are reproduced here. The main differences with respect to that earlier version are:

  1. Apache Preflight now has an option to produce output in XML format (as suggested by William Palmer following the Leeds SPRUCE hackathon)
  2. Better reporting of non-embedded fonts (see also this issue)
  3. Unlike the earlier version, Preflight 2.0.0 does not give any meaningful output in case of encrypted and password-protected PDFs! This is probably a bug, for which I submitted a report here.

Analysis Acrobat Engineering PDFs

Since the Acrobat Engineering site hosts a lot of PDFs, I only focused on a limited subset for the current analysis:

  1. all files in the General section of the Font Testing category;
  2. all files in the Classic Multimedia section of the Multimedia & 3D Tests category.

The results are summarized in two tables (see next sections). For each analysed PDF, the table lists:

  • the error(s) reported by Adobe Acrobat Preflight;
  • the error code(s) reported by Apache Preflight (see Preflight’s source code for a listing of all possible error codes);
  • the error description(s) reported by Apache Preflight in the details output element.

For the sake of readability, the tables only list those error messages/codes that are directly related to font problems, multimedia, encryption and JavaScript. The full output for all tested files can be found here.

Fonts

The table below summarizes the results of the PDFs in the Font Testing category:

Test file Acrobat Preflight error(s) Apache Preflight Error Code(s) Apache Preflight Details
EmbeddedCmap.pdf Font not embedded (and text rendering mode not 3) ; Glyphs missing in embedded font 3.1.3 Invalid Font definition, FontFile entry is missing from FontDescriptor for HeiseiKakuGo-W5
TEXT.pdf Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font ; TrueType font has differences to standard encodings but is not a symbolic font; Wrong encoding for non-symbolic TrueType font 3.1.5; 3.1.1; 3.1.2; 3.1.3; 3.2.4 Invalid Font definition, The Encoding is invalid for the NonSymbolic TTF; Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Arial,Italic(repeated for other fonts); Font damaged, The CharProcs references an element which can’t be read
Type3_WWW-HTML.PDF 3.1.6 Invalid Font definition, The character with CID”58″ should have a width equals to 15.56599 (repeated for other fonts)
embedded_fonts.pdf Font not embedded (and text rendering mode not 3); Type 2 CID font: CIDToGIDMap invalid or missing 3.1.9; 3.1.11 Invalid Font definition; Invalid Font definition, The CIDSet entry is missing for the Composite Subset
embedded_pm65.pdf 3.1.6 Invalid Font definition, Width of the character “110” in the font program “HKPLIB+AdobeCorpID-MyriadRg”is inconsistent with the width in the PDF dictionary (repeated for other font)
notembedded_pm65.pdf Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font 3.1.3 Invalid Font definition, FontFile entry is missing from FontDescriptor for TimesNewRoman (repeated for other fonts)
printtestfont_nonopt.pdf* ICC profile is not valid; ICC profile is version 4.0 or newer; ICC profile uses invalid color space;ICC profile uses invalid type Preflight throws exception (exceptionThrown), exits with message ‘Invalid ICC Profile Data’
printtestfont_opt.pdf* ICC profile is not valid; ICC profile is version 4.0 or newer; ICC profile uses invalid color space; ICC profile uses invalid type Preflight throws exception (exceptionThrown), exits with message ‘Invalid ICC Profile Data’
substitution_fonts.pdf Font not embedded (and text rendering mode not 3) 3.1.1; 3.1.2; 3.1.3 Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Souvenir-Light(repeated for other fonts)
text_images_pdf1.2.pdf Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font; Width information for rendered glyphs is inconsistent 3.1.1; 3.1.2 Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor

As this document doesn’t appear to have any font-related issues it’s unclear why it is in the Font Testing category. Errors related to ICC profiles reproduced here because of relevance to Apache Preflight exception.

General observations

An intercomparison between the results of Acrobat Preflight and Apache Preflight shows that Apache Preflight’s output may vary in case of non-embedded fonts. In most cases it produces error code 3.1.3 (as was the case with the PDF Cabinet of Horrors dataset), but other errors in the 3.1.x range may occur as well. The 3.1.6 “character width” error is something that was also encountered during the London SPRUCE Hackathon, and according to the information here this is most likely the result of the PDF/A specification not being particularly clear. So, this looks like a non-serious error that can be safely ignored in most cases.

Multimedia

The next table shows the results for Multimedia & 3D Tests category:

Test file Acrobat Preflight error(s) Apache Preflight Error Code(s) Apache Preflight Details
20020402_CALOS.pdf 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
Disney-Flash.pdf Contains action of type JavaScript; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Form field does not have appearance dict; Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia-related errors; Preflight did report syntax and body syntax error
Jpeg_linked.pdf Document is encrypted; Encrypt key present in file trailer; Named action with a value other than standard page navigation used; Incorrect annotation type used (not allowed in PDF/A); Font not embedded (and text rendering mode not 3) 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
MultiMedia_Acro6.pdf Document is encrypted; EmbeddedFiles entry in Names dictionary; Encrypt key present in file trailer; PDF contains EF (embedded file) entry; Incorrect annotation type used (not allowed in PDF/A) 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
MusicalScore.pdf CIDset in subset font is incomplete; CIDset in subset font missing; Contains action of type JavaScript; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry; Type 2 CID font: CIDToGIDMap invalid or missing 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
SVG-AnnotAnim.pdf Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 5.2.1; 1.2.9 Forbidden field in an annotation definition, The subtype isn’t authorized : SVG; Body Syntax error, EmbeddedFile entry is present in a FileSpecification dictionary
SVG.pdf Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
ScriptEvents.pdf Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
Service Form_media.pdf Contains action of type JavaScript; Contains action of type ResetForm; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font; Incorrect annotation type used (not allowed in PDF/A); Named action with a value other than standard page navigation used; PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
Trophy.pdf Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
VolvoS40V50-Full.pdf Preflight exits with: “An error occurred while parsing a contents stream. Unable to analyze the PDF file” 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
gXsummer2004-stream.pdf File cannot be loaded in Acrobat (damaged file) 1.0; 1.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
phlmapbeta7.pdf Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
us_population.pdf Preflight exits with: “An error occurred while parsing a contents stream. Unable to analyze the PDF file” 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
movie.pdf Incorrect annotation type used (not allowed in PDF/A) 5.2.1 Forbidden field in an annotation definition, The subtype isn’t authorized : Movie
movie_down1.pdf Incorrect annotation type used (not allowed in PDF/A) 5.2.1 Forbidden field in an annotation definition, The subtype isn’t authorized : Movie
remotemovieurl.pdf Font not embedded (and text rendering mode not 3); Incorrect annotation type used (not allowed in PDF/A) 5.2.1; 3.1.1; 3.1.2; 3.1.3 Forbidden field in an annotation definition, The subtype isn’t authorized : Movie; Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Arial

General observations

The results from the Multimedia PDFs are interesting for several reasons. First of all, these files include a wide variety of ‘risky’ features, such as multimedia content, embedded files, JavaScript, non-embedded fonts and encryption. These were successfully identified by Acrobat Preflight in most cases. Apache Preflight, on the other hand, only reported non-specific and fairly uninformative errors (1.0 + 1.2.1) for 12 out of 17 files. Even thoughPreflight was correct in establishing that these files were not valid PDF/A-1b, it wasn’t able to drill down to the level of specific features for the majority of these files.

Summary and conclusions

The re-analysis of the PDF Cabinet of Horrors corpus, and the subsequent analysis of a sub-set of the Adobe Acrobat Engineering PDFs shows a number of things. First, Apache Preflight 2.0.0 does not properly identify encryption and password-protection. This looks like a bug that is probably easily fixed. Second, the analysis of theFont Testing PDFs from the Acrobat Engineering site revealed that non-embedded fonts may result in a variety of error codes in Apache Preflight (assuming here that the Acrobat Preflight results are accurate). So, when usingApache Preflight to check font embedding, it’s probably a good idea to treat all font-related errors (perhaps with the exception of character width errors) as a potential risk. The more complex PDFs in the Multimedia category proved to be quite challenging to Apache Preflight: for most files here, it was not able to identify specific features such as multimedia content, embedded files, JavaScript and non-embedded fonts. This is not necessarily a problem if Apache Preflight is used for its intended purpose: verify if a PDF conforms to PDF/A-1. However, it does rather limit its use as a tool for profiling heterogeneous PDF collections for specific preservation risks at this stage. This may well change with future versions; in fact the specificity of Preflight‘s validation output already improved considerably since version 1.8.0. In the meantime it’s important to keep the expectations about the tool’s capabilities realistic, in order to avoid some potential unintended misuses.

Links

KB joins the leading Big Data conference in Europe!

hadoopsummitOn March 20-21, Hadoop Summit 2013, the leading big data conference, made its first ever appearance on European soil. The Beurs van Berlage in Amsterdam provided a splendid venue for the gathering of about 500 international participants interested in the newest trends around Big Data and Hadoop. The main hosts Hortonworks and Yahoo did an excellent job in putting together an exciting programme with two days full of enticing sessions divided by four distinct tracks: Applied Hadoop, Operating Hadoop, Hadoop Futures and Integrating Hadoop.

audienceHadoop Summit 2013, © http://www.flickr.com/photos/timoelliott/

The open-source Hadoop software framework allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale out from single servers to thousands of machines.

In his keynote, Hortonworks VP Shaun Connolly’s pointed out that already more than half the world’s data will be processed using Hadoop in 2015! Further on, there were keynotes by 451 Research Director Matt Aslett (What is the point of Hadoop?), Hortonworks founder and CEO Eric Baldeschwieler (Hadoop Now, Next and Beyond) and a live panel that discussed Real-World insight into Hadoop in the Enterprise.

vendorsVendor area at Hadoop Summit 2013, © http://www.flickr.com/photos/timoelliott/

Many interesting talks followed on the use and benefit derived from Hadoop at companies like Facebook, Twitter, Ebay, LinkedIn and alike, as well as on exciting upcoming technologies further enriching the Hadoop ecosystem such as Apache projects Drill, Ambari or the next-generation MapReduce implementation YARN.

The Koninklijke Bibliotheek and the Austrian National Library jointly presented their recent experiences with Hadoop in the SCAPE project. Clemens Neudecker and Sven Schlarb spoke about the potential of integrating Hadoop into digital libraries in their talk “The Elephant in the Library” (video: coming soon).


In the SCAPE project partners are experimenting with integrating Hadoop into library workflows for different large-scale data processing scenarios related to web archiving, file format migration or analytics – you can find out more about the Hadoop related activities in SCAPE here: 
http://www.scape-project.eu/news/scape-hadoop.

After two very successful days the Hadoop Summit concluded and participants agreed there needs to be another one next year – likely again to be held in the amazing city of Amsterdam!

Find out more about Hadoop Summit 2013 in Amsterdam:

Web:             http://hadoopsummit.org/amsterdam/
Facebook:    https://www.facebook.com/HadoopSummit
Pictures:      http://www.flickr.com/photos/timoelliott/
Tweets:       https://twitter.com/search/?q=hadoopsummit
Slides:          http://www.slideshare.net/Hadoop_Summit/
Videos:        http://www.youtube.com/user/HadoopSummit/videos
Blogs:           http://hortonworks.com/blog/hadoop-summit-2013-amsterdam-its-a-wrap/
                     http://www.sentric.ch/blog/hello-europe-hadoop-has-landed
                     http://janbruecher.blogspot.nl/2013/03/2013-hadoop-summit-day-1.html
                     http://janbruecher.blogspot.nl/2013/03/2013-hadoop-summit-day-2.html