Jpylyzer software finalist voor digitale duurzaamheidsprijs

Vandaag maakte de Britse Digital Preservation Coalition de finalisten bekend die in de race zijn voor de Digital Preservation Awards 2014. Deze prijs is in 2004 in het leven geroepen om aandacht te vestigen op initiatieven die een belangrijke bijdrage leveren aan het toegankelijk houden van digitaal erfgoed.

In de categorie Research and Innovation is een op de KB door de afdeling Onderzoek ontwikkelde softwaretool genomineerd: jpylyzer. Met jpylyzer kun je op een eenvoudige manier controleren of JP2 (JPEG 2000) beeldbestanden technisch in orde zijn. Binnen de KB wordt de tool onder meer ingezet bij de kwaliteitscontrole van gedigitaliseerde boeken, kranten en tijdschriften. Jpylyzer wordt ook gebruikt door diverse internationale collega-instellingen.

Jpylyzer is deels ontwikkeld binnen het Europese project SCAPE, waarin de KB projectpartner is. De uiteindelijke winnaars worden op 17 november bekendgemaakt.

Meer informatie over de nominatie van jpylyzer is te vinden op de website van de Digital Preservation Coalition:

http://www.dpconline.org/newsroom/latest-news/1271-dpa-2014finalists

Het volgende artikel is interessant voor wie meer wil weten over jpylyzer, en waarom we zo’n tool eigenlijk nodig hebben:

http://www.kb.nl/research/kb-onderzoek-het-internationale-succes-van-de-jpylyzer-en-wat-is-dat-eigenlijk-voor-ding

Ten slotte is hier de jpylyzer homepage:

 http://openplanets.github.io/jpylyzer/

Breaking down walls in digital preservation (Part 1)

People & knowledge are the keys to breaking down the walls between daily operations and digital preservation (DP) within our organisations. DP is not a technical issue, but information technology must be embraced as as a core feature of the digital library. Such were some of the conclusions of the seminar organised by the SCAPE project/Open Planets Foundation at the Dutch National Library (KB) and National Archives (NA) on Wednesday 2 April. – Report & photographs by Inge Angevaare, visualisations by Elco van Staveren

Newcomer questions some current practices

Menno Rasch (KB)

Menno Rasch (KB): ‘Do correct me if I am wrong’

Menno Rasch was appointed Head of Operations at the Dutch KB 6 months ago – but  ‘I still feel like a newcomer in digital preservation.’ His division includes the Collection Care department which is responsible for DP. But there are close working relationships with the Research and IT departments in the Innovation Division. Rasch’s presentation about embedding DP in business practices in the KB posed some provocative questions:

  • We have a tendency to cover up our mistakes and failures rather than expose them and discuss them in order to learn as a community. That is what pilots do. The platform is there, the Atlas of Digital Damages set up by the KB’s Barbara Sierman, but it is being underused. Of course lots of data are protected by copyright or privacy regulations, but there surely must be some way to anonimise the data.
  • In libraries and archives, we still look upon IT as ‘the guys that make tools for us’. ‘But IT = the digital library.’
  • We need to become more pragmatic. Implementing the OAIS standard is a lot of work – perhaps it is better to take this one step at a time.
  • ‘If you don’t do it now, you won’t do it a year from now.’
  • ‘Any software we build is temporary – so keep the data, not the software.’
  • Most metadata are reproducible – so why not store them in a separate database and put only the most essential preservation metadata in the OAIS information package? That way we can continue improving the metadata. Of course these must be backed up too (an annual snapshot?), but may tolerate a less expensive storage regime than the objects.
  • About developments at the KB: ‘To replace our old DIAS system, we are now developing software to handle all of our digital objects – which is an enormous challenge.’
SCAPE/OPF seminar on managing digital preservation, 4 April 2014, The Hague

The SCAPE/OPF seminar on Managing Digital Preservation, 2 April 2014, The Hague

Digital collections and the Titanic

Zoltan Szatucsket from the Hungarian National Archives used the Titanic for his presentation’s metaphor – without necessarily implying that we are headed for the proverbial iceberg, he added. Although, …  ‘many elements from the Titanic story can illustrate how we think’:

  • Titanic received many warnings about ice formations, and yet it was sailing at full speed when disaster struck.
  • Our ship – the organisation – is quite conservative. It wants to deal with digital records in the same way it deals with paper records. And at the Hungarian National Archives IT and archivist staff are in the same department, which does not work because they do not speak each others’ language.

    Zoltan Szatucsket SCAPESeminar

    Zoltan Szatucsket argued that putting together IT staff and archivists in the Hungarian National Archives caused ‘language’  problems; his Danish colleagues felt that in their case close proximity had rather helped improve communications

  • The captain must acquire new competences. He must learn to manage staff, funding, technology, equipment, etc. We need processes rather than tools.
  • The crew is in trouble too. Their education has not adapted to digital practices. Underfunding in the sector is a big issue. Strangely enough, staff working with medieval resources were much quicker to adopt digital practices than those working with contemporary material. They seem to want to put off any action until legal transfer to the archives actually occurs (15-20 years).
  • Echoing Menno Rasch’s presentation, Szatucsket asked the rhetorical question: ‘Why do we not learn from our mistakes?’ A few months after Titanic, another ship went down in similar circumstances
  • Without proper metadata, objects are lost forever.
  • Last but not least: we have learned that digital preservation is not a technical challenge. We need to create a complete environment in which to preserve.
Szatucsek at Digital Preservation seminar

Is DP heading for the iceberg as well? Visualisation of Szatucsek’s presentation.

OPF: trust, confidence & communication

Ed Fay was appointed director of the Open Planets Foundation (OPF) only six weeks ago. But he presented a clear vision of how the OPF should function within the community, crack in the middle, as a steward of tools, a champion of open communications, trust & confidence, a broker between commercial and non-commercial interests:

Ed Fay Open Planets Foundation vision

Ed Fay’s vision of the Open Planets Foundation’s role in the digital preservation community

Fay also shared some of his experiences in his former job at the London School of Economics:

Ed Fay London School of Economics Organisation

Ed Fay illustrated how digital preservation was moved around a few times in the London School of Economics Library, until it found its present place in the Library division

So, what works, what doesn’t?

The first round-table discussion was introduced by Bjarne Anderson of the Statsbiblioteket Aarhus (DK). He sketched his institution’s experiences in embedding digital preservation.

Bjarene Andersen Statsbiblioteket Aarhus

Bjarne Andersen (right) conferring with Birgit Henriksen (Danish Royal Library, left) and Jan Dalsten Sorensen (Danish National Archives. ‘SCRUM has helped move things along’

He mentioned the recently introduced SCRUM-based methodology as really having helped to move things along – it is an agile way of working which allows for flexibility. The concept of ‘user stories’ helps to make staff think about the ‘why’. Menno Rasch (KB) agreed: ‘SCRUM works especially well if you are not certain where to go. It is a step-by-step methodology.’

Some other lessons learned at Aarhus:

  • The responsibility for digital preservation cannot be with the developers implementing the technical solutions
  • The responsibility needs to be close to ‘the library’
  • Don’t split the analogue and digital library entirely – the two have quite a lot in common
  • IT development and research are necessary activities to keep up with a changing landscape of technology
  • Changing the organisation a few times over the years helped us educate the staff by bringing traditional collection/library staff close to IT for a period of time.
SCAPE seminar group discussion

Group discussion. From the left: Jan Dalsten Sorensen (DK), Ed Fay (OPF), Menno Rasch (KB), Marcin Werla (PL), Bjarne Andersen (DK), Elco van Staveren (KB, visualising the discussion), Hildelies Balk (KB) and Ross King (Austria)

And here is how Elco van Staveren visualised the group discussion in real time:

Some highlights from the discussion:

  • Embedding digital preservation is about people
  • It really requires open communication channels.
  • A hierarchical organisation and/or an organisation with silos only builds up the wall. Engaged leadership is called for. And result-oriented incentives for staff rather than hierarchical incentives.
  • Embedding digital preservation in the organisation requires a vision that is shared by all.
  • Clear responsibilities must be defined.
  • Move the budgets to where the challenges are.
  • The organisation’s size may be a relevant factor in deciding how to organise DP. In large organisations, the wheels move slowly (no. of staff in the Hungarian National Archives 700; British Library 1,500; Austrian National Library 400; KB Netherlands 300, London School of Economics 120, Statsbiblioteket Aarhus 200).
  • Most organisations favour bringing analogue and digital together as much as possible.
  • When it comes to IT experts and librarians/archivists learning each other’s languages, it was suggested that maybe hard IT staff need not get too deeply involved in library issues – in fact, some IT staff might consider it bad for their careers. Software developers, however, do need to get involved in library/archive affairs.
  • Management must also be taught the language of the digital library and digital preservation.

(Continued in Breaking down walls in digital preservation, part 2)

Seminar agenda and links to presentations

Keep Calm 'cause Titanic is Unsinkable

Identification of PDF preservation risks: the sequel

Author: Johan van der Knijff
Originally posted on: http://www.openplanetsfoundation.org/blogs/2013-07-25-identification-pdf-preservation-risks-sequel

Last winter I started a first attempt at identifying preservation risks in PDF files using the Apache Preflight PDF/A validator. This work was later followed up by others in two SPRUCE hackathons in Leeds (see this blog post by Peter Cliff) and London (described here). Much of this later work tacitly assumes that Apache Preflight is able to successfully identify features in PDF that are a potential risk for long-term access. This Wiki page on uses and abuses of Preflight (created as part of the final SPRUCE hackathon) even goes as far as stating that “Preflight is thorough and unforgiving (as it should be)“. But what evidence do we have to support such claims? The only evidence that I’m aware of, are the results obtained from a small test corpus of custom-created PDFs. Each PDF in this corpus was created in such a way that it includes only one specific feature that is a potential preservation risk (e.g. encryption, non-embedded fonts, and so on). However, PDFs that exist ‘in the wild’ are usually more complex. Also, the PDF specification often allows you to implement similar features in subtly different ways. For these reasons, it is essential to obtain additional evidence of Preflight‘s ability to detect ‘risky’ features before relying on this tool in any operational setting.

Adobe Acrobat Engineering test files

Shortly after I completed my initial tests, Adobe released the Acrobat Engineering website, which contains a large volume of test documents that are used by Adobe for testing their products. Although the test documents are not fully annotated, they are subdivided into categories such as Multimedia & 3D Tests and Font tests. This makes these files particularly useful for additional tests on Preflight.

Methodology

The general methodology I used to analyse these files is identical to what I did in my 2012 report: first, each PDF was validated using Apache Preflight. As a control I also validated the PDFs with the Preflight component of Adobe Acrobat, using the PDF/A-1b profile. The table below lists the software versions used:

Software Version
Apache Preflight 2.0.0
Adobe Acrobat 10.14
Acrobat Preflight 10.1.3 (090)

Re-analysis of PDF Cabinet of Horrors corpus

Because the current analysis is based on a more recent version of Apache Preflight than the one used in the 2012 report (which was 1.8.0), I first re-ran the analysis of the PDFs in the PDF Cabinet of Horrors corpus. The main results are reproduced here. The main differences with respect to that earlier version are:

  1. Apache Preflight now has an option to produce output in XML format (as suggested by William Palmer following the Leeds SPRUCE hackathon)
  2. Better reporting of non-embedded fonts (see also this issue)
  3. Unlike the earlier version, Preflight 2.0.0 does not give any meaningful output in case of encrypted and password-protected PDFs! This is probably a bug, for which I submitted a report here.

Analysis Acrobat Engineering PDFs

Since the Acrobat Engineering site hosts a lot of PDFs, I only focused on a limited subset for the current analysis:

  1. all files in the General section of the Font Testing category;
  2. all files in the Classic Multimedia section of the Multimedia & 3D Tests category.

The results are summarized in two tables (see next sections). For each analysed PDF, the table lists:

  • the error(s) reported by Adobe Acrobat Preflight;
  • the error code(s) reported by Apache Preflight (see Preflight’s source code for a listing of all possible error codes);
  • the error description(s) reported by Apache Preflight in the details output element.

For the sake of readability, the tables only list those error messages/codes that are directly related to font problems, multimedia, encryption and JavaScript. The full output for all tested files can be found here.

Fonts

The table below summarizes the results of the PDFs in the Font Testing category:

Test file Acrobat Preflight error(s) Apache Preflight Error Code(s) Apache Preflight Details
EmbeddedCmap.pdf Font not embedded (and text rendering mode not 3) ; Glyphs missing in embedded font 3.1.3 Invalid Font definition, FontFile entry is missing from FontDescriptor for HeiseiKakuGo-W5
TEXT.pdf Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font ; TrueType font has differences to standard encodings but is not a symbolic font; Wrong encoding for non-symbolic TrueType font 3.1.5; 3.1.1; 3.1.2; 3.1.3; 3.2.4 Invalid Font definition, The Encoding is invalid for the NonSymbolic TTF; Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Arial,Italic(repeated for other fonts); Font damaged, The CharProcs references an element which can’t be read
Type3_WWW-HTML.PDF 3.1.6 Invalid Font definition, The character with CID”58″ should have a width equals to 15.56599 (repeated for other fonts)
embedded_fonts.pdf Font not embedded (and text rendering mode not 3); Type 2 CID font: CIDToGIDMap invalid or missing 3.1.9; 3.1.11 Invalid Font definition; Invalid Font definition, The CIDSet entry is missing for the Composite Subset
embedded_pm65.pdf 3.1.6 Invalid Font definition, Width of the character “110” in the font program “HKPLIB+AdobeCorpID-MyriadRg”is inconsistent with the width in the PDF dictionary (repeated for other font)
notembedded_pm65.pdf Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font 3.1.3 Invalid Font definition, FontFile entry is missing from FontDescriptor for TimesNewRoman (repeated for other fonts)
printtestfont_nonopt.pdf* ICC profile is not valid; ICC profile is version 4.0 or newer; ICC profile uses invalid color space;ICC profile uses invalid type Preflight throws exception (exceptionThrown), exits with message ‘Invalid ICC Profile Data’
printtestfont_opt.pdf* ICC profile is not valid; ICC profile is version 4.0 or newer; ICC profile uses invalid color space; ICC profile uses invalid type Preflight throws exception (exceptionThrown), exits with message ‘Invalid ICC Profile Data’
substitution_fonts.pdf Font not embedded (and text rendering mode not 3) 3.1.1; 3.1.2; 3.1.3 Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Souvenir-Light(repeated for other fonts)
text_images_pdf1.2.pdf Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font; Width information for rendered glyphs is inconsistent 3.1.1; 3.1.2 Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor

As this document doesn’t appear to have any font-related issues it’s unclear why it is in the Font Testing category. Errors related to ICC profiles reproduced here because of relevance to Apache Preflight exception.

General observations

An intercomparison between the results of Acrobat Preflight and Apache Preflight shows that Apache Preflight’s output may vary in case of non-embedded fonts. In most cases it produces error code 3.1.3 (as was the case with the PDF Cabinet of Horrors dataset), but other errors in the 3.1.x range may occur as well. The 3.1.6 “character width” error is something that was also encountered during the London SPRUCE Hackathon, and according to the information here this is most likely the result of the PDF/A specification not being particularly clear. So, this looks like a non-serious error that can be safely ignored in most cases.

Multimedia

The next table shows the results for Multimedia & 3D Tests category:

Test file Acrobat Preflight error(s) Apache Preflight Error Code(s) Apache Preflight Details
20020402_CALOS.pdf 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
Disney-Flash.pdf Contains action of type JavaScript; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Form field does not have appearance dict; Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia-related errors; Preflight did report syntax and body syntax error
Jpeg_linked.pdf Document is encrypted; Encrypt key present in file trailer; Named action with a value other than standard page navigation used; Incorrect annotation type used (not allowed in PDF/A); Font not embedded (and text rendering mode not 3) 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
MultiMedia_Acro6.pdf Document is encrypted; EmbeddedFiles entry in Names dictionary; Encrypt key present in file trailer; PDF contains EF (embedded file) entry; Incorrect annotation type used (not allowed in PDF/A) 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
MusicalScore.pdf CIDset in subset font is incomplete; CIDset in subset font missing; Contains action of type JavaScript; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry; Type 2 CID font: CIDToGIDMap invalid or missing 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
SVG-AnnotAnim.pdf Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 5.2.1; 1.2.9 Forbidden field in an annotation definition, The subtype isn’t authorized : SVG; Body Syntax error, EmbeddedFile entry is present in a FileSpecification dictionary
SVG.pdf Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
ScriptEvents.pdf Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
Service Form_media.pdf Contains action of type JavaScript; Contains action of type ResetForm; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font; Incorrect annotation type used (not allowed in PDF/A); Named action with a value other than standard page navigation used; PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
Trophy.pdf Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
VolvoS40V50-Full.pdf Preflight exits with: “An error occurred while parsing a contents stream. Unable to analyze the PDF file” 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
gXsummer2004-stream.pdf File cannot be loaded in Acrobat (damaged file) 1.0; 1.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
phlmapbeta7.pdf Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
us_population.pdf Preflight exits with: “An error occurred while parsing a contents stream. Unable to analyze the PDF file” 1.0; 1.2.1 No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
movie.pdf Incorrect annotation type used (not allowed in PDF/A) 5.2.1 Forbidden field in an annotation definition, The subtype isn’t authorized : Movie
movie_down1.pdf Incorrect annotation type used (not allowed in PDF/A) 5.2.1 Forbidden field in an annotation definition, The subtype isn’t authorized : Movie
remotemovieurl.pdf Font not embedded (and text rendering mode not 3); Incorrect annotation type used (not allowed in PDF/A) 5.2.1; 3.1.1; 3.1.2; 3.1.3 Forbidden field in an annotation definition, The subtype isn’t authorized : Movie; Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Arial

General observations

The results from the Multimedia PDFs are interesting for several reasons. First of all, these files include a wide variety of ‘risky’ features, such as multimedia content, embedded files, JavaScript, non-embedded fonts and encryption. These were successfully identified by Acrobat Preflight in most cases. Apache Preflight, on the other hand, only reported non-specific and fairly uninformative errors (1.0 + 1.2.1) for 12 out of 17 files. Even thoughPreflight was correct in establishing that these files were not valid PDF/A-1b, it wasn’t able to drill down to the level of specific features for the majority of these files.

Summary and conclusions

The re-analysis of the PDF Cabinet of Horrors corpus, and the subsequent analysis of a sub-set of the Adobe Acrobat Engineering PDFs shows a number of things. First, Apache Preflight 2.0.0 does not properly identify encryption and password-protection. This looks like a bug that is probably easily fixed. Second, the analysis of theFont Testing PDFs from the Acrobat Engineering site revealed that non-embedded fonts may result in a variety of error codes in Apache Preflight (assuming here that the Acrobat Preflight results are accurate). So, when usingApache Preflight to check font embedding, it’s probably a good idea to treat all font-related errors (perhaps with the exception of character width errors) as a potential risk. The more complex PDFs in the Multimedia category proved to be quite challenging to Apache Preflight: for most files here, it was not able to identify specific features such as multimedia content, embedded files, JavaScript and non-embedded fonts. This is not necessarily a problem if Apache Preflight is used for its intended purpose: verify if a PDF conforms to PDF/A-1. However, it does rather limit its use as a tool for profiling heterogeneous PDF collections for specific preservation risks at this stage. This may well change with future versions; in fact the specificity of Preflight‘s validation output already improved considerably since version 1.8.0. In the meantime it’s important to keep the expectations about the tool’s capabilities realistic, in order to avoid some potential unintended misuses.

Links

KB joins the leading Big Data conference in Europe!

hadoopsummitOn March 20-21, Hadoop Summit 2013, the leading big data conference, made its first ever appearance on European soil. The Beurs van Berlage in Amsterdam provided a splendid venue for the gathering of about 500 international participants interested in the newest trends around Big Data and Hadoop. The main hosts Hortonworks and Yahoo did an excellent job in putting together an exciting programme with two days full of enticing sessions divided by four distinct tracks: Applied Hadoop, Operating Hadoop, Hadoop Futures and Integrating Hadoop.

audienceHadoop Summit 2013, © http://www.flickr.com/photos/timoelliott/

The open-source Hadoop software framework allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale out from single servers to thousands of machines.

In his keynote, Hortonworks VP Shaun Connolly’s pointed out that already more than half the world’s data will be processed using Hadoop in 2015! Further on, there were keynotes by 451 Research Director Matt Aslett (What is the point of Hadoop?), Hortonworks founder and CEO Eric Baldeschwieler (Hadoop Now, Next and Beyond) and a live panel that discussed Real-World insight into Hadoop in the Enterprise.

vendorsVendor area at Hadoop Summit 2013, © http://www.flickr.com/photos/timoelliott/

Many interesting talks followed on the use and benefit derived from Hadoop at companies like Facebook, Twitter, Ebay, LinkedIn and alike, as well as on exciting upcoming technologies further enriching the Hadoop ecosystem such as Apache projects Drill, Ambari or the next-generation MapReduce implementation YARN.

The Koninklijke Bibliotheek and the Austrian National Library jointly presented their recent experiences with Hadoop in the SCAPE project. Clemens Neudecker and Sven Schlarb spoke about the potential of integrating Hadoop into digital libraries in their talk “The Elephant in the Library” (video: coming soon).


In the SCAPE project partners are experimenting with integrating Hadoop into library workflows for different large-scale data processing scenarios related to web archiving, file format migration or analytics – you can find out more about the Hadoop related activities in SCAPE here: 
http://www.scape-project.eu/news/scape-hadoop.

After two very successful days the Hadoop Summit concluded and participants agreed there needs to be another one next year – likely again to be held in the amazing city of Amsterdam!

Find out more about Hadoop Summit 2013 in Amsterdam:

Web:             http://hadoopsummit.org/amsterdam/
Facebook:    https://www.facebook.com/HadoopSummit
Pictures:      http://www.flickr.com/photos/timoelliott/
Tweets:       https://twitter.com/search/?q=hadoopsummit
Slides:          http://www.slideshare.net/Hadoop_Summit/
Videos:        http://www.youtube.com/user/HadoopSummit/videos
Blogs:           http://hortonworks.com/blog/hadoop-summit-2013-amsterdam-its-a-wrap/
                     http://www.sentric.ch/blog/hello-europe-hadoop-has-landed
                     http://janbruecher.blogspot.nl/2013/03/2013-hadoop-summit-day-1.html
                     http://janbruecher.blogspot.nl/2013/03/2013-hadoop-summit-day-2.html

Digital preservation – The cost of doing nothing

Author: Barbara Sierman
Originally posted on: http://digitalpreservation.nl/seeds/the-cost-of-doing-nothing

800px-De_geldwisselaar_en_zijn_vrouw

Lately there was much debate on the fact that over the years the digital preservation community mastered to create a collection of more than a dozen of cost models, making the confusion for every one starting in digital preservation even bigger. May be this is part of the way things are going: everyone sees his own situation as something special with special needs. The solution? Tayloring an existing model or developing a new one. We can expect help from the recently started European project 4C ,”The Collaboration to Clarify the Costs of Curation”. In their introduction they state that “4C reminds us that the point of this investment [in digital preservation] is to realise a benefit”. Less emphasis on the complexity of digital preservation, and more on the benefits.

Some people think that talking about digital preservation in terms of complexity and costs sounds more negative than thinking in terms of opportunities (or challenges) and benefits. But in both cases, you will need the same hard-core figures about the costs you make as an organisation and the benefits that raise from it. The latter is not easy to do, but the work of Neil Beagrie and his team shows that it will be possible to measure the benefits.

If we would have better figures of the benefits of preserving digital material, we are in a better position to estimate what it will cost us if digital material is not preserved. Of letting digital objects die, be it intentionally or not.  How much damage is done to society if crucial information is not preserved?  Recently the question was raised that some interesting websites, containing the research results of a project that lasted for several years,  might not be harvested and preserved in a digital archive. Consequence of this would be a tremendous loss for the community in the related research discipline. This is clearly an incentive for preservation!

I remember that when the Planets project was proposed, it was argued that the obsolescence of digital information in Europe,  in case no action to preserve it would be taken, could cost the community an astonishing amount of 3 billion euro a year. I could not find a source for this assumption, only a reference to some articles. One of them described the amount of data that was created worldwide. The other article described the costs for an organization if lacking proper tools to manage data (getting access, searching,  not finding etc). It could be that the Planets assumption derived from this information was used as an illustration to make the case for digital preservation (the amount of stories in the Atlas of Digital Damages does not prove this assumption).

But in essence, it are these kind of figures (and their related evidence) we also need to have at hand. Not only demonstrating the costs of digital preservation, but also demonstrating what it would cost society if we did not preserve things.

Trusted access to scholarly publications

In December 2012 the 3rd Cultural Heritage online conference was held in Florence. Theme of the conference was “Trusted Digital Repositories and Trusted Professionals. At the conference a presentation was given on the KB international e-Depot with the title: The international e-Depot to guarantee permanent access to scholarly publications.

conference room

The international e-Depot of the KB is the long-term archive for international academic literature for Dutch scholars, operating since 2003. This archival role is of importance because it enables us to guarantee permanent access to scholarly literature. National libraries have a depository role for national publications. The KB goes a step further and also preserves publications from international, academic publishers that do not have a clear country of origin. The next step for the KB is to position the international e-Depot as a European service, which guarantees permanent access to international, academic publications for the entire community of European researchers.

The trend towards e-only access for scholarly journals is continuing rapidly, and a growing number of journals are ‘born digital’ and have no printed counterpart. For researchers there is a huge benefit because they have online access to journal articles, anywhere, any time. The downside is an increasing dependency on digital access. Without permanent access to information, scholarly activities are no longer possible. But there is a danger that e-journals become “ephemeral” unless we take active steps to preserve the bits and bytes that increasingly represent our collective knowledge.

We are all familiar with examples of hardware and software becoming obsolete. On top of this threat of technical obsolescence there is the changing role of libraries. In the past libraries have assumed preservation responsibility for material they collect, while publishers have supplied the material libraries need. These well understood divisions of labour do not work in a digital environment and especially so when dealing with e-journals.

Research and developments in digital preservation issues have grown mature. Tools and services are being developed to help perform digital preservation activities. In addition, third-party organizations and archiving solutions are established to help the academic community to preserve publications and to advance research in sustainable ways. As permanent access is to digital information is expensive, co-operation is essential, each organization having its own role and responsibility.

The KB has invested in order to take its place within the research infrastructure at European level and the international e-Depot serves as a trustworthy digital archive for scholarly information for the European research community.

Sustainability is more than saving the bits

Author: Barbara Sierman
Originally posted on: http://digitalpreservation.nl/seeds/uncategorized/sustainability-is-more-then-saving-the-bits/

Sustaining-our-digital-future-FINAL-31

The subject of the JISC/SCA report Sustaining our digital Future. Institutional strategies for digital content. By Nancy L. Maron, Jason Yun and Sarah Pickle (2013),  is the sustainability of digitised collections in general, illustrated with experiences of three different organisations: University College London, The Imperial War Museum and the National Library of Wales. I was especially interested by the fact that the report mentions digital preservation, but not as a goal in itself (“saving the bits”). Instead, the authors broaden the scope of digital preservation with activities that are beyond bit preservation or even beyond “functional preservation”.

Nowadays a lot of digitisation projects are undertaken and interesting material comes to life for a large audience, often with a fancy website, a press release, a blog (and a big investment)  and attracts immediately  interested public. But the problematic phase starts when the project is finished. In organizations like universities, with a variety of digitisation projects, lack of central coordination of these projects could cause “disappearance” of project results, simple because hardly anyone knew about it. We all know these stories, and this report describes the ways these 3 organizations try to avoid that risk.

Internal coordination seems to be a key factor in this process. One organisation integrated more than a hundred databases in a central catalogue, another draw together several teaching collections. Both efforts resulted in visibility of the collections. But this is not enough to achieve permanent (long term) access.  The data will be stored safely, but who is taking care of all the related products, that support the visibility of the data? In other (digital preservation jargon) words, who is monitoring the Designated Community and their changing environment?

The report describes interesting activities.  Take for example this one: the intended public need to be reminded constantly of the existence of the digitized material by promotion actions, otherwise the collections will not be used at all. Who is planning this activity as part of digital preservation? That the changing environment needs to be updated sounds familiar. But there is more reason to do this apart from technical reasons. Websites need to be redesigned to be attractive, to adapt to changing user experiences. And who is monitoring whether there might be a new group of interested  visitors?

Or, as Lyn Lewis Dafgis of the National Library of Wales said, there is an assumption that

once digitised, the content is sustainable just by virtue of living in the digital asset management system and by living in the central catalogue.

And this needs to change.

Not seldom digital preservation is seen as something that deals with access to the digital collections somewhere in the future. Permanent access, which is the goal of digital preservation, is often seen as solved by “bit preservation” and if you do a really good job “functional preservation”. This report illustrates with some good examples what more needs to be done and is coloring the not always well understood OAIS phrase “monitoring the Designated Community”.

Peeking over the wall

Author: Barbara Sierman
Originally posted on: http://digitalpreservation.nl/seeds/tag/digital-forensics/

tn_DSCF6162

Recently a very interesting report was published in the series of DPC Technology Watch Reports  Digital Forensics and Preservation, by Jeremy Leighton John, from the British Library. I knew the phrase “digital forensics” and its potentials for digital preservation, especially in the archival community. This report shows clearly, with practical examples, what the digital preservationists can learn from the digital forensics. One could think that, as this is often related to personal archives, it might not be of interest for organizations that don’t have personal archives in their collections. But this reports shows that there is much ground in common to raise the interest.

The digital forensics process had some starting points that are very similar to what in the digital preservation community is referred to as “authenticity” and “the original”.

  1.  Acquire the evidence without altering or damaging the original,
  2. Establish and demonstrate that the examined evidence is the same as that which was originally obtained and
  3. Analyse the evidence in an accountable and repeatable fashion (p14).

The fact that the forensic practices has narrow links with legal authorities forces them to act towards criteria that are not always present in the environments of cultural heritage organisations. This might make the tools and approaches that are used in digital forensics more strict and reliable.

forensics-213x300

As noted in the report (p.17) a distinction between digital forensics and digital preservation is that the latter is aiming to have the material being accessible over time and by many different users, while digital forensics focus often on one specific goal: the court case. Of course this influences  the methodology used in digital forensics, in this report referred to as the “lifecycle”, but the similarities in approaches between digital preservation and digital forensics are striking (p. 21). Especially related to steps that prepare the material before “archival storage”, so the ingest and pre-ingest steps. As digital forensics is often confronted with handhelds, smartphones and tablets etc. –  a relatively new category for libraries – , the methods and insights they have developed could be of tremendous help for libraries, especially those with personal collections.

The benefit of this report is the practical line of approach, with references and descriptions to (open source) tools that are used in the digital forensics community. A wide range of examples that underpin the case for digital forensics are described, and I experienced a frequent occurrence of “aha Erlebnis”. Recognition of similar challenges and areas of interest: cloud computing, large scale, emulation, privacy, the need of test environments with reliable corpora (for libraries always difficult because of copyright) etc.  The report summarizes a long list of conclusions (one of them the “inertia” of libraries and archives to preserve personal archives, but maybe we can extend that to the hesitation to preserve offline digital material in general) and finishes with a set of Recommended Actions, of which I conclude as the generic topic: collaboration.

Collaboration will be the most beneficial when parties involved are aware of their needs and what they want to achieve. I think that although we talked a lot of digital preservation, and much is yet not clear, we have a set of starting points that will support us in collaboration activities. The OAIS model still offers a very clear and understandable set of coherent concepts. For those that need a more practical explanation the series of audit materials like DSA , DIN,  and RAC  can support them.  In a way digital preservation has grown up and is able to look around in other, less obviously adjacent disciplines. The people interested in emulation learned a lot of the open source community that rescued games (EU project  KEEP – website no longer available) . Data visualisation, as was mentioned in a blog at The Signal , could help us identifying patterns in collections and perhaps identify risks, if applied in a clever way. Human Computer Interaction (HCI)  science was mentioned by Luciana Duranti to be involved in her research into Records in the Cloud.

Sometimes people are wondering where all investments in digital preservation (research) have brought us so far. There seems to be no end to the challenges with the rapid technology changes.  But I like the view that seems to emerge that there are rich opportunities to collaborate between (established) disciplines. Peeking over the wall  around your own garden into the neighbours courtyard can offer some interesting views. Picking some of the seeds could make your border a stunning one!

OAIS in het Nederlands! (OAIS in Dutch!)

Finally, an article on the OAIS model has been written in Dutch. Barbara Sierman of the KB National library of the Netherlands, Research department, wrote “Het OAIS-model, een leidraad voor duurzame toegankelijkheid” in Handboek Informatiewetenschap, issue 62, December 2012. The article describes the most important concepts of the latest version of the standard for digital preservation (2012) in clear terms.

Within the KB, the OAIS model guides the design of the new digital repoitory, and is important to everyone involved in the long-term preservation of digital material – from acquisition to metadata and from IT to online access. The article will also appear in www.iwabase.nl

————————————————————————————-

Eindelijk is er een artikel in het Nederlands verschenen over het OAIS model. Barbara Sierman van de Koninklijke Bibliotheek, afdeling Onderzoek, schreef “Het OAIS-model, een leidraad voor duurzame toegankelijkheid” in het Handboek Informatiewetenschap, aflevering 62 van december 2012. De beschrijving gaat uit van de laatste versie van de standaard voor digitale duurzaamheid (2012) en beschrijft in heldere taal de belangrijkste concepten.

Het OAIS model is binnen de KB leidend bij het ontwerp van het nieuwe Digitaal Magazijn, en is van belang voor iedereen een rol speelt bij het duurzaam toegankelijk houden van digitaal materiaal. Van acquisitie tot metadatering en van IT tot online toegang. Het artikel verschijnt ook in www.iwabase.nl

illustratie5

The Elephant in the Library: KB at Hadoop Summit Europe

Clemens Neudecker (technical coordinator in the Reseach department at the National Library of the Netherlands) and Sven Schlarb (Austrian National Library) will present the paper ‘The Elephant in the Library’ at the upcoming Hadoop Summit Europe, the leading conference for the Apache Hadoop community.

The paper, which is based on the work being done in the SCAPE project, discusses the role Apache Hadoop is playing in the mass digitization of cultural heritage in the MLA sector. Clemens and Sven were recently interviewed about their participation at this large-scale event – the interview is available from the Hadoop website: Meet the Presenters.

Paper abstract:
Libraries collect books, magazines and newspapers. Yes, that’s what they always did. But today, the amount of digital information resources is growing at dizzying speed. Facing the demand of digital information resources available 24/7, there has been a significant shift regarding a library’s core responsibilities. Today’s libraries are curating large digital collections, indexing millions of full-text documents, preserving Terabytes of data for future generations, and at the same time exploring innovative ways of providing access to their collections. 

This is exactly where Hadoop comes into play. Libraries have to process a rapidly increasing amount of data as part of their day-to-day business and computing tasks like file format migration, text recognition, linguistic processing, etc., require significant computing resources. Many data processing scenarios emerge where Hadoop might become an essential part of the digital library’s ecosystem. Hadoop is sometimes referred to as a hammer where you have to throw away everything that is not a nail. To remain in that metaphor: we will present some actual use cases for Hadoop in libraries, how we determine what are the nails in a library and what not, and some initial results.