The KB participates in the Europeana Newspapers project that has started in February 2012. The project will enrich 18 million pages of digitised newspapers with Optical Character Recognition (OCR), Optical Layout Recognition (OLR) and Named Entity Recognition (NER) from all over Europe and deliver them to Europeana. The project consortium consists of 18 partners from all over Europe: some will provide (technical) support, while other will provide their digitised newspapers. The KB has two roles: we will not only deliver 2 million of our newspaper pages to Europeana, but we will also enrich ours and the newspapers of other partners with NER.
In the last months, the project has welcomed 11 new associated partners and to make sure they can benefit as much as possible from the experiences of the project partners the University Library of Belgrade and LIBER jointly organised a workshop on refinement and aggregation on 13 and 14 June. Here, the KB (Clemens Neudecker and I) presented the work that is currently being done to make sure that we will have Named Entities for several partners. To make sure that the work that is being done in the project also benefits our direct colleagues, we were joined by someone from our Digitisation department.
The workshop started with a warm welcome in Belgrade by the director of the library, Prof. Aleksandar Jerkov. After a short introduction into the project by the project leader Hans-Jörg Lieder from the State Library Berlin, Clemens Neudecker from the KB presented the refinement process of the project. All presentations will be shared on the project’s Slideshare account. The refinement of the newspapers has already started and is being done by the University of Innsbruck and the company CCS in Hamburg. However, it was still a big surprise when Hans-Jörg Lieder announced a present for the director of the University Library Belgrade; the first batch of their processed newspapers!
The day continued with an introduction into the importance of evaluation of OCR and OLR and a demonstration of the tools used for this by Stefan Pletschacher and Cristian Clausner from the University of Salford. This sparked some interesting discussions in the break-out sessions on methods of evaluation in the libraries digitising their collections. For example, do you tell your service provider what you will be checking when you receive a batch? You could argue that the service provider would then only fix what you check. On the other hand if that is what you need to reach your goal it would save a lot of time and rejected batches.
After a short getting-to-know-each-other session the whole workshop party moved to the Nikola Tesla Museum nearby where we were introduced to their newspaper clippings project. All newspaper clippings collected by Nikola Tesla are now being digitised for publication on the museum’s website. A nice tour through the museum followed with several demonstrations (don’t worry, no one was electrocuted) and the day was concluded with a dinner in the bohemian quarter.
The second day of the workshop was dedicated solely to refinement. I kicked off the day with the question ‘What is a named entity?’. This sounds easy, but can provide you with some dilemmas as well. For example, a dog’s name is a name, but do you want it to be tagged as a NE? And what do you do with a title such as Romeo and Juliet? Consistency is key in this and as long as you keep your goal in mind while training your software you should end up with the results you are looking for.
Claus Gravenhorst followed me with his presentation on OLR at CCS, by using docWorks, with which they will process 2 million pages. It was then again our turn with a hands-on session about the tools we’re using, which are also available on Github. The last session of the workshop was a collaboration between Claus Gravenhorst from CCS and Günter Mühlberger from the University of Innsbruck who gave us a nice insight into their tools and the considerations made when working with digitised newspapers. For example, how many categories would you need to tag every article?
All in all, it was a very successful workshop and I hope that all participants enjoyed it as much as I have. I at least am happy to have spoken to so many interesting people with new experiences from other digitisation projects. There is still much to learn from each other and projects like Europeana Newspapers contribute towards a good exchange of knowledge between libraries to ensure our users get the best experience when browsing through the rich digital collections.