Clemens Neudecker (technical coordinator in the Reseach department at the National Library of the Netherlands) and Sven Schlarb (Austrian National Library) will present the paper ‘The Elephant in the Library’ at the upcoming Hadoop Summit Europe, the leading conference for the Apache Hadoop community.

The paper, which is based on the work being done in the SCAPE project, discusses the role Apache Hadoop is playing in the mass digitization of cultural heritage in the MLA sector. Clemens and Sven were recently interviewed about their participation at this large-scale event – the interview is available from the Hadoop website: Meet the Presenters.

Paper abstract:
Libraries collect books, magazines and newspapers. Yes, that’s what they always did. But today, the amount of digital information resources is growing at dizzying speed. Facing the demand of digital information resources available 24/7, there has been a significant shift regarding a library’s core responsibilities. Today’s libraries are curating large digital collections, indexing millions of full-text documents, preserving Terabytes of data for future generations, and at the same time exploring innovative ways of providing access to their collections. 

This is exactly where Hadoop comes into play. Libraries have to process a rapidly increasing amount of data as part of their day-to-day business and computing tasks like file format migration, text recognition, linguistic processing, etc., require significant computing resources. Many data processing scenarios emerge where Hadoop might become an essential part of the digital library’s ecosystem. Hadoop is sometimes referred to as a hammer where you have to throw away everything that is not a nail. To remain in that metaphor: we will present some actual use cases for Hadoop in libraries, how we determine what are the nails in a library and what not, and some initial results.