1st Succeed hackathon @ KB

Throughout recent weeks, rumors spread at KB National Library of the Netherlands that there would be a party of programmers coming to the library to participate in a so-called “hackathon”. In the beginning, especially the IT department was rather curious: will we have to expect port scans being done from within the National Library’s network? Do we need to apply special security measures? Fortunately, none of that was necessary.

A “hackathon” is nothing to be afraid of, normally. On the contrary: the informal gatherings of software developers to work collaboratively on creating and improving new or existing software tools and/or data have emerged as a prominent pattern in recent years – in particular the hack4Europe series of hack days that is organized by Europeana has shown that this model can also be successfully applied in the context of cultural heritage digitization.

After that was sorted, a network switch with static IP addresses was deployed by the facilities department of the KB, thereby ensuring that participants of the event had a fast and robust internet connection at all times and allowing access to the public parts of the internet and the restricted research infrastructure of the KB at the same time – which received immediate praise from the hackers. Well done, KB!

So when the software developers from Austria, England, France, Poland, Spain and the Netherlands gathered at the KB last Thursday, everyone already knew they were indeed here to collaboratively work on one of the European projects the KB is involved in: the Succeed project. The project had called in software developers from all over Europe to participate in the 1st Succeed hackathon to work on interoperability of tools and workflows for text digitization.

There was a good mix of people from the digitization as well as digital preservation communities, with some additional Taverna expertise tossed in. While about half of the participants had participated in either Planets, IMPACT or SCAPE, the other half of them were new to the field and eager to learn about the outcomes of these projects and how Succeed will address them.

And so after some introduction followed by coffee and fruit, the 15 participants immersed straight away into the various topics that were suggested prior to the event as needing attention. And indeed, the results that were presented by the various groups after 1.5 days (but only 8 hours of effective working time) were pretty impressive…

Hackers at work @ KB Succeed hackathon

The developers from INL were able to integrate some of the servlets they created in IMPACT and Namescape with the interoperability-framework – although also some bugs were uncovered while doing so. They will be fixed asap, rest assured!  Also, with the help of the PSNC digital libraries team, Bob and Jesse were able to create a small training set for Tesseract, outperforming the standard dictionary despite some problems that were found in training Tesseract version 3.02. Fortunately it was possible to apply the training to version 3.0and then run the generated classifier in Tesseract version 3.02, which is the current stable(?) release.

Even better: the colleagues from Poznań (who have a track record of successful participation at hackathons) had already done some training with Tesseract earlier and developed some supporting tools for it. Quickly Piotr created a tool description for the “cutouts” tool that automatically creates binarized clippings of characters from a source image. On the second day another feature of the cutouts application was added: creating an artificial image suitable for training Tesseract from the binarized character clippings. When finally wrapping the two operations in a Taverna workflow time eventually ran out, but given only little work remained we look forward to see the Taverna workflow for Tesseract training becoming available shortly! Certainly this is also of interest to the eMOP project in the US, in which the KB is a partner as well.

Meanwhile, another colleague from Poznań was investigating the process of creating packages for Debian-based Linux operating systems from existing (open source) tools. And despite using a laptop with OSX Mountain Lion, Tomasz managed to present a valid Debian package (including even icon and man page) – kudos! Certainly the help of Carl from the Open Planets Foundation was also partly to blame for that…next steps will include creating a change log straight off github. To be continued!

Two colleagues from PSNC-dl working on a Tesseract training workflow

Another group attending the event were the team from LITIS lab at the University of Rouen. Thierry demonstrated the newest PLaIR tools such as the newspaper segmenter capable of automatically separating articles in scanned newspaper images.  The PLaIR tools use GEDI as the encoding format, so some work was immediately invested by David to also support the PAGE format, the predominant format for document encoding used in the IMPACT tools, thereby in principle establishing interoperability between IMPACT and PLaIR applications. In addition, since the PLaIR tools are mostly already available as web services, Philippine started with creating Taverna workflows for these methods. We look forward to complement the existing IMPACT workflows with those additional modules from PLaIR!

plairScreenshot of the PLaIR system for post-correction of newspaper OCR

All this was done without requiring any help from the PRImA group at the University of Salford, Greater Manchester, who are maintaining the PAGE format and a number of tools to support it. So with some free time on his hand, Christian from PRImA instead had a deeper look at Taverna and the PAGE serialization of the recently released open source OCR evaluation tool from the University of Alicante, the technical lead of the Centre of Competence, and found it to be working quite fine. Good to finally have an open source community tool for OCR evaluation with support for PAGE – and more features shall be added soon: we’re thinking word accuracy rate, bag-of-words evaluation and more – send us your feature requests (or even better: pull request).

We were particularly glad also that some developers beyond the usual MLA community suspects have found the way to the KB on those 2 days: a team from the Leiden University Medical Centre was also attending, keen on learning how they could use the T2-Client for their purposes. Initially slowed down by some issues encountered in deploying Taverna 2 Server on a Windows machine (don’t do it!), eventually Reinout and Eelke were able to resolve it simply by using Linux instead. We hope a further collaboration of Dutch Taverna users will arise from this!

Besides all the exciting new tools and features it was good to also see some others getting their hands dirty with (essential) engineering tasks – work progressed well on several issues from the interoperability-framework’s issue tracker: support for output directories is close to being fully implemented thanks to Willem Jan, and a good start was made on future MTOM support. Also Quique from the Centre of Competence was able to improve the integration between IMPACT services and the website Demonstrator Platform.

Without the help of experienced developers Carl from the Open Planets Foundation and Sven from the Austrian National Library (who had just conducted a training event for the SCAPE project earlier in the same week in London, and quickly decided to cross the channel for yet one more workshop), this would not have been so easily possible. While Carl was helping out everywhere at once, Sven found some time to fit in a Taverna training session after lunch on Friday, which was hugely appreciated from the audience.

Sven Schlarb from the Austrian National Library delivering Taverna training

After seeing all the powerful capabilities of Taverna in combination with the interoperability-framework web services and scripts in a live demo, no one needed further reassurance that it was well worth spending the time to integrate this technology and work with the interoperability-framework and it’s various components.

Everyone said they really enjoyed the event and found plenty of valuable things that they had learned and wanted to continue working with. So watch out for the next Succeed hackathon in sunny Alicante next year!

Trusted access to scholarly publications

In December 2012 the 3rd Cultural Heritage online conference was held in Florence. Theme of the conference was “Trusted Digital Repositories and Trusted Professionals. At the conference a presentation was given on the KB international e-Depot with the title: The international e-Depot to guarantee permanent access to scholarly publications.

conference room

The international e-Depot of the KB is the long-term archive for international academic literature for Dutch scholars, operating since 2003. This archival role is of importance because it enables us to guarantee permanent access to scholarly literature. National libraries have a depository role for national publications. The KB goes a step further and also preserves publications from international, academic publishers that do not have a clear country of origin. The next step for the KB is to position the international e-Depot as a European service, which guarantees permanent access to international, academic publications for the entire community of European researchers.

The trend towards e-only access for scholarly journals is continuing rapidly, and a growing number of journals are ‘born digital’ and have no printed counterpart. For researchers there is a huge benefit because they have online access to journal articles, anywhere, any time. The downside is an increasing dependency on digital access. Without permanent access to information, scholarly activities are no longer possible. But there is a danger that e-journals become “ephemeral” unless we take active steps to preserve the bits and bytes that increasingly represent our collective knowledge.

We are all familiar with examples of hardware and software becoming obsolete. On top of this threat of technical obsolescence there is the changing role of libraries. In the past libraries have assumed preservation responsibility for material they collect, while publishers have supplied the material libraries need. These well understood divisions of labour do not work in a digital environment and especially so when dealing with e-journals.

Research and developments in digital preservation issues have grown mature. Tools and services are being developed to help perform digital preservation activities. In addition, third-party organizations and archiving solutions are established to help the academic community to preserve publications and to advance research in sustainable ways. As permanent access is to digital information is expensive, co-operation is essential, each organization having its own role and responsibility.

The KB has invested in order to take its place within the research infrastructure at European level and the international e-Depot serves as a trustworthy digital archive for scholarly information for the European research community.

Sustainability is more than saving the bits

Author: Barbara Sierman
Originally posted on: http://digitalpreservation.nl/seeds/uncategorized/sustainability-is-more-then-saving-the-bits/


The subject of the JISC/SCA report Sustaining our digital Future. Institutional strategies for digital content. By Nancy L. Maron, Jason Yun and Sarah Pickle (2013),  is the sustainability of digitised collections in general, illustrated with experiences of three different organisations: University College London, The Imperial War Museum and the National Library of Wales. I was especially interested by the fact that the report mentions digital preservation, but not as a goal in itself (“saving the bits”). Instead, the authors broaden the scope of digital preservation with activities that are beyond bit preservation or even beyond “functional preservation”.

Nowadays a lot of digitisation projects are undertaken and interesting material comes to life for a large audience, often with a fancy website, a press release, a blog (and a big investment)  and attracts immediately  interested public. But the problematic phase starts when the project is finished. In organizations like universities, with a variety of digitisation projects, lack of central coordination of these projects could cause “disappearance” of project results, simple because hardly anyone knew about it. We all know these stories, and this report describes the ways these 3 organizations try to avoid that risk.

Internal coordination seems to be a key factor in this process. One organisation integrated more than a hundred databases in a central catalogue, another draw together several teaching collections. Both efforts resulted in visibility of the collections. But this is not enough to achieve permanent (long term) access.  The data will be stored safely, but who is taking care of all the related products, that support the visibility of the data? In other (digital preservation jargon) words, who is monitoring the Designated Community and their changing environment?

The report describes interesting activities.  Take for example this one: the intended public need to be reminded constantly of the existence of the digitized material by promotion actions, otherwise the collections will not be used at all. Who is planning this activity as part of digital preservation? That the changing environment needs to be updated sounds familiar. But there is more reason to do this apart from technical reasons. Websites need to be redesigned to be attractive, to adapt to changing user experiences. And who is monitoring whether there might be a new group of interested  visitors?

Or, as Lyn Lewis Dafgis of the National Library of Wales said, there is an assumption that

once digitised, the content is sustainable just by virtue of living in the digital asset management system and by living in the central catalogue.

And this needs to change.

Not seldom digital preservation is seen as something that deals with access to the digital collections somewhere in the future. Permanent access, which is the goal of digital preservation, is often seen as solved by “bit preservation” and if you do a really good job “functional preservation”. This report illustrates with some good examples what more needs to be done and is coloring the not always well understood OAIS phrase “monitoring the Designated Community”.