KB Research

Research at the National Library of the Netherlands

Tag: hadoop

The Elephant Returns to the Library…with a Pig!

Hadoop Driven Digital Preservation Hackathon in Vienna

Organised by: SCAPE project  and  Open Planets Foundation

By Clemens Neudecker & René van der Ark

These days, libraries are no longer exclusively collecting physical books and publications, but are also investing in digitisation on a massive scale. At the same time they are harvesting born-digital publications and websites alike. The problem of how to preserve all that digital information in the (very) long run has received a lot of attention. So called “preservation risks” pose a severe threat to the long-term availability of these digital assets. To name but a few: bitrot, format obsolescence and lack of open tools and frameworks.

To tackle these problems now and in the future, beefing up server performance and storage space is no longer a viable option. The growth of data is simply too fast to keep up. Therefore, in recent years, computer scientists tend to opt for scaling out, as opposed to scaling up. This is where buzzwords like the cloud and big data come in. To demystify: instead of scaling up with expensive hardware, scale out by setting up a cluster of cheap machines and doing distributed parallel calculations on them. Scaling out is now often the solution for fast processing of big data, but the principle might just as well be applied to safe (redundant) storage.

The European research project SCAPE (SCAlable Preservation Environments) has been set up in order to help the GLAM community lift their preservation technology to the big data needs of the 21st century digital library. One of the key ideas in SCAPE is to leverage big data technologies like Hadoop, and to apply them in order to scale out the preservation tools and technologies currently in use.

The hackathon on “Hadoop Driven Digital Preservation” at the Austrian National Library (ONB) in Vienna therefore provided a great opportunity for the KB to further its understanding of Hadoop and all its applications. Especially because Hadoop guru Jimmy Lin from the University of Maryland joined us not only as keynote, but also as teacher and co-hacker. Jimmy Lin has past experience working for Twitter and Cloudera and shared many insights in using Hadoop on a productive scale, several dimensions greater than what libraries are currently struggling with. One of his recent projects was to implement a web archive browser on Hadoop’s HBase called WarcBase. A great initiative which might just turn into the next generation Wayback Machine.

Besides, Vienna in December is always worth a visit!

At the event

The event started out with an introduction to the two use cases that were provided upfront by the organisers:

1) Web-Archiving: File Format Identification/Characterisation

2) Digital Books: Quality Assurance, text mining (OCR Quality)

However, participants were free to dive into either of these issues, continue developing their own projects or just investigate completely fresh ideas. So it was not a big surprise when soon after Jimmy’s first presentation introducing Pig as an alternative to writing MapReduce jobs “for lazy people”, many of the participants decided to work on creating small Pig scripts for various preservation related tools.

The nice thing about Pig is that it somewhat resembles common query languages like SQL. This makes it quite readable for most IT savy people. Also it is extensible with custom functions, which can be implemented in Java. Writing some of these user defined functions (UDF) is what we decided to focus on.

This event distinguished itself by the great amount of collaboration. As code reuse was greatly encouraged we decided to fork Jimmy Lin’s WarcBase project on github and extend it with UDF’s for language detection and MIME-type detection using Apache TIKA. The UDF’s we wrote were then in turn again used by many of the other participants’ projects.

The rest of the time we used to get more familiar with writing Pig scripts to apply on actual ARC/WARC files. While unfortunately there was a lack of publicly available ARC/WARC files for testing our MIME-type and language detection UDF’s, we were lucky that colleague Per Møldrup-Dalum from the SB in Aarhus had a cluster and a large collection from the Danish web archive  available for us to test these on:

HadoopVersion   PigVersion            UserId    StartedAt               FinishedAt
2.0.0-cdh4.0.1     0.9.2-cdh4.0.1      scape      14:19:53                 14:22:22

JobId:    job_201308301115_0151

Maps      Reduces
172         19

MaxMapTime       MinMapTime      AvgMapTime
106                         18                           77

MaxReduceTime MinReduceTime  AvgReduceTime
19                           17                           19

Alias                       Feature
a,b,c,d,e,f,raw      GROUP_BY,COMBINER


Successfully read 547093 records (3985472612 bytes) from:


Successfully stored 90 records (1613 bytes) in:


Total records written : 90
Total bytes written : 1613
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Honestly, we were a bit surprised ourselves when we got these numbers from Per – did our simple Pig script of 7 lines really just process the almost 550,000 ARC/WARC records in only 2:29 minutes? Indeed it did!

To learn more about the various results and outcomes of the event, make sure to check the blog post by Sven Schlarb. It must be mentioned that the colleagues from the ONB and OPF did an outstanding job in terms of event preparation – next to two real-life use cases, they also provided a virtual machine with a pseudo-distributed Hadoop and some more helpful tools from the Hadoop ecosystem that could be used for experimentation. In fact, there was even a cluster with some data ready to execute MapReduce jobs against and really test out how well they would scale. Thanks again!


While the KB has been a member of PLANETS, a predecessor to SCAPE, as well as a member of the OPF and the SCAPE project, so far we have only had little time to experiment with Hadoop in our library.

Currently we are looking into using Hadoop for migrating around 150 TB of TIF images from the Metamorfoze Programme to the JPEG2000 format. Following the example of the British Library, we started experimenting with our own implementation of a TIF → JP2 workflow using Hadoop. Will Palmer from the British Library Digital Preservation team has already successfully built such a workflow and published it on github.

However, the TIF → JP2 migration is just about the hardest scenario to optimally implement on Hadoop. The encoding algorithm is very complex and would actually need to be rewritten entirely for parallel processing to make use of the power of Hadoop. Nevertheless, we believe that Hadoop has serious potential  – so the KB is also investigating some more use cases for Hadoop, currently in at least three different areas:

1. Webarchiving

The KB is one of the partners in the WebART project, where together with researchers from the UvA and the CWI new tools and methods to maximize the web archive’s utility for research are being created. Hadoop and HBase are also amongst the applications used here. Together with CWI and UvA the KB hopes to start up a new project for establishing an instance of WarcBase running on top of a scalable HBase cluster – this would really enable new, scalable ways of researching the Dutch web history. Colleague Thaer Sammar from CWI also participated in the hackathon, and the results of his efforts were quite convincing. Also, Jimmy is again one of the collaborators – we keep your fingers crossed!

2. Content enrichment

In the Europeana Newspaper project, the KB is currently creating a framework for named entity recognition (NER) on historical newspapers from all over Europe. Around 10 million pages of full-text will be created in the project, and the KB will provide named entities software for materials in Dutch, German and French language. It is expected that at least 2 million pages will be processed with NER, but the total collection of digitised newspapers at the KB is 8.5 million. Plus there are other collections (books, journals, radiobulletins) for which OCR exists. The KB aims to have all the entities in its digital collections detected, disambiguated and linked within the next 5 years. Given that all of this is text, and the sofware for NER is in Java, it would be interesting to see in how far Hadoop could be used to scale out the processing of all this data.

3. Business processes

From the organization, we are aware of a few particular scalability issues with some of the business processes, such as:

  • Generating business reports quickly: i.e. counting all the KB’s newspapers per publisher is now a painstakingly slow and somewhat unreliable process.
  • Acting as a data provider: harvesting the newspapers’ metadata sequentially takes a week now, and will only take longer when the collection is expanded. Harvesting records in parallel (16 requests / second) created serious stress on the current middleware and even came close to crashing it.
  • Parallelizing (pre-) ingest processes, like metadata validation, file characterization and checksum validation.

Finally, we have also looked at HDFS for scalable, but durable storage. However, for preservation purposes storing files on the Hadoop cluster might introduce some risk, because the files would be partitioned and scattered across the cluster. Then again, the concept of scaling out can still be useful, for example to:

  • prevent bitrot: by replication of files and using agents which repair corrupt replica’s on a regular basis;
  • applying file migrations: here processing data on one machine in sequential batches is again not viable in the long run.

As you see, while there is no Hadoop cluster running in the KB (yet?), it seems there are sufficient ideas and use cases for continuing to work with these technologies, and to build up expertise with Hadoop, HBase etc. We are also very interested in exchanging ideas and use cases with other libraries that are already using Hadoop productively. Last but not least, Hadoopsummit Europe will be held in Amsterdam again next year – so, we’ll see you there perhaps?

Useful links:



KB joins the leading Big Data conference in Europe!

hadoopsummitOn March 20-21, Hadoop Summit 2013, the leading big data conference, made its first ever appearance on European soil. The Beurs van Berlage in Amsterdam provided a splendid venue for the gathering of about 500 international participants interested in the newest trends around Big Data and Hadoop. The main hosts Hortonworks and Yahoo did an excellent job in putting together an exciting programme with two days full of enticing sessions divided by four distinct tracks: Applied Hadoop, Operating Hadoop, Hadoop Futures and Integrating Hadoop.

audienceHadoop Summit 2013, © http://www.flickr.com/photos/timoelliott/

The open-source Hadoop software framework allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale out from single servers to thousands of machines.

In his keynote, Hortonworks VP Shaun Connolly’s pointed out that already more than half the world’s data will be processed using Hadoop in 2015! Further on, there were keynotes by 451 Research Director Matt Aslett (What is the point of Hadoop?), Hortonworks founder and CEO Eric Baldeschwieler (Hadoop Now, Next and Beyond) and a live panel that discussed Real-World insight into Hadoop in the Enterprise.

vendorsVendor area at Hadoop Summit 2013, © http://www.flickr.com/photos/timoelliott/

Many interesting talks followed on the use and benefit derived from Hadoop at companies like Facebook, Twitter, Ebay, LinkedIn and alike, as well as on exciting upcoming technologies further enriching the Hadoop ecosystem such as Apache projects Drill, Ambari or the next-generation MapReduce implementation YARN.

The Koninklijke Bibliotheek and the Austrian National Library jointly presented their recent experiences with Hadoop in the SCAPE project. Clemens Neudecker and Sven Schlarb spoke about the potential of integrating Hadoop into digital libraries in their talk “The Elephant in the Library” (video: coming soon).

[slideshare id=18002904&doc=neudeckerschlarb20march1550administratiezaalv2-130401120136-phpapp01]

In the SCAPE project partners are experimenting with integrating Hadoop into library workflows for different large-scale data processing scenarios related to web archiving, file format migration or analytics – you can find out more about the Hadoop related activities in SCAPE here: 

After two very successful days the Hadoop Summit concluded and participants agreed there needs to be another one next year – likely again to be held in the amazing city of Amsterdam!

Find out more about Hadoop Summit 2013 in Amsterdam:

Web:             http://hadoopsummit.org/amsterdam/
Facebook:    https://www.facebook.com/HadoopSummit
Pictures:      http://www.flickr.com/photos/timoelliott/
Tweets:       https://twitter.com/search/?q=hadoopsummit
Slides:          http://www.slideshare.net/Hadoop_Summit/
Videos:        http://www.youtube.com/user/HadoopSummit/videos
Blogs:           http://hortonworks.com/blog/hadoop-summit-2013-amsterdam-its-a-wrap/

The Elephant in the Library: KB at Hadoop Summit Europe

Clemens Neudecker (technical coordinator in the Reseach department at the National Library of the Netherlands) and Sven Schlarb (Austrian National Library) will present the paper ‘The Elephant in the Library’ at the upcoming Hadoop Summit Europe, the leading conference for the Apache Hadoop community.

The paper, which is based on the work being done in the SCAPE project, discusses the role Apache Hadoop is playing in the mass digitization of cultural heritage in the MLA sector. Clemens and Sven were recently interviewed about their participation at this large-scale event – the interview is available from the Hadoop website: Meet the Presenters.

Paper abstract:
Libraries collect books, magazines and newspapers. Yes, that’s what they always did. But today, the amount of digital information resources is growing at dizzying speed. Facing the demand of digital information resources available 24/7, there has been a significant shift regarding a library’s core responsibilities. Today’s libraries are curating large digital collections, indexing millions of full-text documents, preserving Terabytes of data for future generations, and at the same time exploring innovative ways of providing access to their collections. 

This is exactly where Hadoop comes into play. Libraries have to process a rapidly increasing amount of data as part of their day-to-day business and computing tasks like file format migration, text recognition, linguistic processing, etc., require significant computing resources. Many data processing scenarios emerge where Hadoop might become an essential part of the digital library’s ecosystem. Hadoop is sometimes referred to as a hammer where you have to throw away everything that is not a nail. To remain in that metaphor: we will present some actual use cases for Hadoop in libraries, how we determine what are the nails in a library and what not, and some initial results.

© 2018 KB Research

Theme by Anders NorenUp ↑