Extrablatt! Final Report Europeana Newspapers published!

Reblogged from http://www.europeana-newspapers.eu/final-report/ 

All things must come to an end eventually – even the Europeana Newspapers project. The good news is that in every end, there is also a new beginning. But more on this later.

After 38 months of hard, but also very fun and rewarding work with our network, the project has officially come to a close in March 2015. But as usual with such endeavours, there are still many activities which go on beyond the lifetime of the project, such as reporting and reviewing, disseminating and spreading the results as well as fostering takeup and new initiatives around the use and exploitation of said outcomes.

Continue reading

Named entity recognition for digitised historical newspapers

Europeana NewspapersThe refinement partners in the Europeana Newspapers project will produce the astonishing amount of 10 million pages of full-text from historical newspapers from all over Europe. What could be done to further enrich that full-text?

The KB National Library of the Netherlands has been investigating named entity recognition (NER) and linked data technologies for a while now in projects such as IMPACT and STITCH+, and we felt it was about time to approach this on a production scale. So we decided to produce (open source) software, trained models as well as raw training data for NER software applications specifically for digitised historical newspapers as part of the project.

What is named entity recognition (NER)?

Named entity recognition is the process of identifying and classifying entities such as persons, locations and organisations in the full-text in order to enhance searchability. There are basically two types of approaches, a statistical and a rule based one. Rule based systems rely mostly on grammar rules defined by linguists, while statistical systems require large amounts of manually produced training data that they can learn from. While both approaches have their benefits and drawbacks, we decided to go for a statistical tool, the CRFNER system from Stanford University. In comparison, this software proved to be the most reliable, and it is supported by an active user community. Stanford University has an online demo where you can try it out: http://nlp.stanford.edu:8080/ner/.

ner

Example of Wikipedia article for Albert Einstein, tagged with the Stanford NER tool

Requirements & challenges

There are some particular requirements and challenges when applying these techniques to digital historical newspapers. Since full-text for more than 10 million pages will be produced in the project, one requirement for our NER tool was that it should be able to process large amounts of texts in a rather short time. This is possible with the Stanford tool,  which as of version 1.2.8 is “thread-safe”, i.e. it can run in parallel on a multi-core machine. Another requirement was to preserve the information about where on a page a named entity has been detected – based on coordinates. This is particularly important for newspapers: instead of having to go through all the articles on a newspaper page to find the named entity, it can be highlighted so that one can easily spot it even on very dense pages.

Then there are also challenges of course – mainly due to the quality of the OCR and the historical spelling that is found in many of these old newspapers. In the course of 2014 we will thus collaborate with the Dutch Institute for Lexicology (INL), who have produced modules which can be used in a pre-processing step before the Stanford system and that can to some extent mitigate problems caused by low quality of the full-text or the appearance of historical spelling variants.

The Europeana Newspapers NER workflow

For Europeana Newspapers, we decided to focus on three languages: Dutch, French and German. The content in these three languages makes up for about half of the newspaper pages that will become available through Europeana Newspapers. For the French materials, we cooperate with LIP6-ACASA, for Dutch again with INL. The workflow goes like this:

  1. We receive OCR results in ALTO format (or METS/MPEG21-DIDL containers)
  2. We process the OCR with our NER software to derive a pre-tagged corpus
  3. We upload the pre-tagged corpus into an online Attestation Tool (provided by INL)
  4. Within the Attestation Tool, the libraries make corrections and add tags until we arrive at a “gold corpus”, i.e. all named entities on the pages have been manually marked
  5. We train our NER software based on the gold corpus derived in step (4)
  6. We process the OCR again with our NER software trained on the gold corpus
  7. We repeat steps (2) – (6) until the results of the tagging won’t improve any further

    NER slide

    Screenshot of the NER Attestation Tool

Preliminary results

Named entity recognition is typically evaluated by means of Precision/Recall and F-measure. Precision gives an account of how many of the named entities that the software found are in fact named entities of the correct type, while Recall states how many of the total amount of named entities present have been detected by the software. The F-measure then combines both scores into a weighted average between 0 – 1. Here are our (preliminary) results for Dutch so far:

Dutch

Persons

Locations

Organizations

Precision

0.940

0.950

0.942

Recall

0.588

0.760

0.559

F-measure

0.689

0.838

0.671

These figures have been derived from a k-fold cross-evaluation of 25 out of 100 manually tagged pages of Dutch newspapers from the KB. The results confirm the fact that the Stanford system tends to be a bit “conservative”, i.e. it has a somewhat lower recall for the benefit of higher precision, which is also what we wanted.

Conclusion and outlook

Within this final year of the project we are looking forward to see in how far we can still boost these figures by adopting the extra modules from INL, and what results we can achieve on the French and German newspapers. We will also investigate software for linking the named entities to additional online resource descriptions and authority files such as DBPedia or VIAF to create Linked Data. The crucial question will be how well we can disambiguate the named entities and find a correct match in these resources. Besides, if there is time, we would also want to experiment with NER in other languages, such as Serbian or Latvian. And, if all goes well, you might already hear more about this at the upcoming IFLA newspapers conference “Digital transformation and the changing role of news media in the 21st Century“.

References

The KB, Big data and digital humanities at the kick off of the Dutch weekend of Science

The KB, Big data and digital humanities at the kick off of the Dutch weekend of  Science

The KB gave a presentation at  the Science dinner, the official kick off of the Dutch weekend of Science. Main theme of the walking dinner was digital treasures.

The Science dinner at the Van Nelle fabriek

The Science dinner at the Van Nelle fabriek

In between courses there were presentations which all related to this theme.  The first presentation was delivered by the KB.

The future of the KB is digital. Material is being digitized at a fast pace and important progress is made in the area of digital services. The aim is to increase the outreach and actively encourage the use of the rich KB collection.

To show what can be done with all this new data the KB invited three guests to give their vision on the use of big data in their field of work:

What is their relationship with Big Data and Digital Humanities? How do they see the future of  Digital Humanities and the use of Big data? What fascinates them when it comes to new possibilities?

To illustrate their relationship with Big data  introductory films have been made:

(English subtitles available by clicking  the Watch on Youtube button)

“For heritage research Big Data is a whole new and exciting field”

Julia Noordegraaf, Professor of Heritage and Digital Culture, University of Amsterdam

 

 “Science asks the question: what is knowledge? Art approaches this theme poetically by speculating and creating things.” Geert Mul, Media artist

“I have a love-hate relationship with the use of computers for language research” Professor Marc van Oostendorp of  Leiden University and the first digital humanities fellow of the KB

1st Succeed hackathon @ KB

Throughout recent weeks, rumors spread at KB National Library of the Netherlands that there would be a party of programmers coming to the library to participate in a so-called “hackathon”. In the beginning, especially the IT department was rather curious: will we have to expect port scans being done from within the National Library’s network? Do we need to apply special security measures? Fortunately, none of that was necessary.

A “hackathon” is nothing to be afraid of, normally. On the contrary: the informal gatherings of software developers to work collaboratively on creating and improving new or existing software tools and/or data have emerged as a prominent pattern in recent years – in particular the hack4Europe series of hack days that is organized by Europeana has shown that this model can also be successfully applied in the context of cultural heritage digitization.

After that was sorted, a network switch with static IP addresses was deployed by the facilities department of the KB, thereby ensuring that participants of the event had a fast and robust internet connection at all times and allowing access to the public parts of the internet and the restricted research infrastructure of the KB at the same time – which received immediate praise from the hackers. Well done, KB!

So when the software developers from Austria, England, France, Poland, Spain and the Netherlands gathered at the KB last Thursday, everyone already knew they were indeed here to collaboratively work on one of the European projects the KB is involved in: the Succeed project. The project had called in software developers from all over Europe to participate in the 1st Succeed hackathon to work on interoperability of tools and workflows for text digitization.

There was a good mix of people from the digitization as well as digital preservation communities, with some additional Taverna expertise tossed in. While about half of the participants had participated in either Planets, IMPACT or SCAPE, the other half of them were new to the field and eager to learn about the outcomes of these projects and how Succeed will address them.

And so after some introduction followed by coffee and fruit, the 15 participants immersed straight away into the various topics that were suggested prior to the event as needing attention. And indeed, the results that were presented by the various groups after 1.5 days (but only 8 hours of effective working time) were pretty impressive…

hack
Hackers at work @ KB Succeed hackathon

The developers from INL were able to integrate some of the servlets they created in IMPACT and Namescape with the interoperability-framework – although also some bugs were uncovered while doing so. They will be fixed asap, rest assured!  Also, with the help of the PSNC digital libraries team, Bob and Jesse were able to create a small training set for Tesseract, outperforming the standard dictionary despite some problems that were found in training Tesseract version 3.02. Fortunately it was possible to apply the training to version 3.0and then run the generated classifier in Tesseract version 3.02, which is the current stable(?) release.

Even better: the colleagues from Poznań (who have a track record of successful participation at hackathons) had already done some training with Tesseract earlier and developed some supporting tools for it. Quickly Piotr created a tool description for the “cutouts” tool that automatically creates binarized clippings of characters from a source image. On the second day another feature of the cutouts application was added: creating an artificial image suitable for training Tesseract from the binarized character clippings. When finally wrapping the two operations in a Taverna workflow time eventually ran out, but given only little work remained we look forward to see the Taverna workflow for Tesseract training becoming available shortly! Certainly this is also of interest to the eMOP project in the US, in which the KB is a partner as well.

Meanwhile, another colleague from Poznań was investigating the process of creating packages for Debian-based Linux operating systems from existing (open source) tools. And despite using a laptop with OSX Mountain Lion, Tomasz managed to present a valid Debian package (including even icon and man page) – kudos! Certainly the help of Carl from the Open Planets Foundation was also partly to blame for that…next steps will include creating a change log straight off github. To be continued!

psnc
Two colleagues from PSNC-dl working on a Tesseract training workflow

Another group attending the event were the team from LITIS lab at the University of Rouen. Thierry demonstrated the newest PLaIR tools such as the newspaper segmenter capable of automatically separating articles in scanned newspaper images.  The PLaIR tools use GEDI as the encoding format, so some work was immediately invested by David to also support the PAGE format, the predominant format for document encoding used in the IMPACT tools, thereby in principle establishing interoperability between IMPACT and PLaIR applications. In addition, since the PLaIR tools are mostly already available as web services, Philippine started with creating Taverna workflows for these methods. We look forward to complement the existing IMPACT workflows with those additional modules from PLaIR!

plairScreenshot of the PLaIR system for post-correction of newspaper OCR

All this was done without requiring any help from the PRImA group at the University of Salford, Greater Manchester, who are maintaining the PAGE format and a number of tools to support it. So with some free time on his hand, Christian from PRImA instead had a deeper look at Taverna and the PAGE serialization of the recently released open source OCR evaluation tool from the University of Alicante, the technical lead of the Centre of Competence, and found it to be working quite fine. Good to finally have an open source community tool for OCR evaluation with support for PAGE – and more features shall be added soon: we’re thinking word accuracy rate, bag-of-words evaluation and more – send us your feature requests (or even better: pull request).

We were particularly glad also that some developers beyond the usual MLA community suspects have found the way to the KB on those 2 days: a team from the Leiden University Medical Centre was also attending, keen on learning how they could use the T2-Client for their purposes. Initially slowed down by some issues encountered in deploying Taverna 2 Server on a Windows machine (don’t do it!), eventually Reinout and Eelke were able to resolve it simply by using Linux instead. We hope a further collaboration of Dutch Taverna users will arise from this!

Besides all the exciting new tools and features it was good to also see some others getting their hands dirty with (essential) engineering tasks – work progressed well on several issues from the interoperability-framework’s issue tracker: support for output directories is close to being fully implemented thanks to Willem Jan, and a good start was made on future MTOM support. Also Quique from the Centre of Competence was able to improve the integration between IMPACT services and the website Demonstrator Platform.

Without the help of experienced developers Carl from the Open Planets Foundation and Sven from the Austrian National Library (who had just conducted a training event for the SCAPE project earlier in the same week in London, and quickly decided to cross the channel for yet one more workshop), this would not have been so easily possible. While Carl was helping out everywhere at once, Sven found some time to fit in a Taverna training session after lunch on Friday, which was hugely appreciated from the audience.

sven
Sven Schlarb from the Austrian National Library delivering Taverna training

After seeing all the powerful capabilities of Taverna in combination with the interoperability-framework web services and scripts in a live demo, no one needed further reassurance that it was well worth spending the time to integrate this technology and work with the interoperability-framework and it’s various components.

Everyone said they really enjoyed the event and found plenty of valuable things that they had learned and wanted to continue working with. So watch out for the next Succeed hackathon in sunny Alicante next year!

Europeana Newspapers Refinement & Aggregation Workshop

The KB participates in the Europeana Newspapers project that has started in February 2012. The project will enrich 18 million pages of digitised newspapers with Optical Character Recognition (OCR), Optical Layout Recognition (OLR) and Named Entity Recognition (NER) from all over Europe and deliver them to Europeana. The project consortium consists of 18 partners from all over Europe: some will provide (technical) support, while other will provide their digitised newspapers. The KB has two roles: we will not only deliver 2 million of our newspaper pages to Europeana, but we will also enrich ours and the newspapers of other partners with NER.

Untitled

Europeana Newspapers Workshop in Belgrade

In the last months, the project has welcomed 11 new associated partners and to make sure they can benefit as much as possible from the experiences of the project partners the University Library of Belgrade and LIBER jointly organised a workshop on refinement and aggregation on 13 and 14 June. Here, the KB (Clemens Neudecker and I) presented the work that is currently being done to make sure that we will have Named Entities for several partners. To make sure that the work that is being done in the project also benefits our direct colleagues, we were joined by someone from our Digitisation department.

The workshop started with a warm welcome in Belgrade by the director of the library, Prof. Aleksandar Jerkov. After a short introduction into the project by the project leader Hans-Jörg Lieder from the State Library Berlin, Clemens Neudecker from the KB presented the refinement process of the project. All presentations will be shared on the project’s Slideshare account. The refinement of the newspapers has already started and is being done by the University of Innsbruck and the company CCS in Hamburg. However, it was still a big surprise when Hans-Jörg Lieder announced a present for the director of the University Library Belgrade; the first batch of their processed newspapers!

Giving a gift of 200,000 digitised and refined newspapers to our Belgrade hosts

Giving a gift of 200,000 digitised and refined newspapers to our Belgrade hosts

The day continued with an introduction into the importance of evaluation of OCR and OLR and a demonstration of the tools used for this by Stefan Pletschacher and Cristian Clausner from the University of Salford. This sparked some interesting discussions in the break-out sessions on methods of evaluation in the libraries digitising their collections. For example, do you tell your service provider what you will be checking when you receive a batch? You could argue that the service provider would then only fix what you check. On the other hand if that is what you need to reach your goal it would save a lot of time and rejected batches.

After a short getting-to-know-each-other session the whole workshop party moved to the Nikola Tesla Museum nearby where we were introduced to their newspaper clippings project. All newspaper clippings collected by Nikola Tesla are now being digitised for publication on the museum’s website. A nice tour through the museum followed with several demonstrations (don’t worry, no one was electrocuted) and the day was concluded with a dinner in the bohemian quarter.

Breakout groups at the Belgrade Workshop

The second day of the workshop was dedicated solely to refinement. I kicked off the day with the question ‘What is a named entity?’. This sounds easy, but can provide you with some dilemmas as well. For example, a dog’s name is a name, but do you want it to be tagged as a NE? And what do you do with a title such as Romeo and Juliet? Consistency is key in this and as long as you keep your goal in mind while training your software you should end up with the results you are looking for.

Claus Gravenhorst followed me with his presentation on OLR at CCS, by using docWorks, with which they will process 2 million pages. It was then again our turn with a hands-on session about the tools we’re using, which are also available on Github. The last session of the workshop was a collaboration between Claus Gravenhorst from CCS and Günter Mühlberger from the University of Innsbruck who gave us a nice insight into their tools and the considerations made when working with digitised newspapers. For example, how many categories would you need to tag every article?

Group photo from the Europeana Newspapers workshop in Belgrade

All in all, it was a very successful workshop and I hope that all participants enjoyed it as much as I have. I at least am happy to have spoken to so many interesting people with new experiences from other digitisation projects. There is still much to learn from each other and projects like Europeana Newspapers contribute towards a good exchange of knowledge between libraries to ensure our users get the best experience when browsing through the rich digital collections.

Succeed Project launched

Author: Clemens Neudecker
Originally posted on: http://www.openplanetsfoundation.org/blogs/2013-02-05-succeed-project-launched

The kick-off meeting of the Succeed project (http://www.succeed-project.eu) took place on Friday 1 February in Paris.

RTEmagicC_succeed.jpg

Succeed is a project coordinated by the Universidad de Alicante and supported by the European Commission with a contribution of 1.8 mio. €.


The core objective of Succeed is to promote the take-up of the research results generated by technological companies and research centres in Europe in a strategic field for Europe: digitisation and preservation of its cultural heritage.


Succeed will foster the take-up of the most recent tools and techniques by libraries, museums and archives through the organisation of meetings of experts in digitisation, competitions to evaluate techniques, technical conferences to broadcast results and through the maintenance of an online platform for the demonstration and evaluation of tools.


Succeed will contribute in this way to the coordination of efforts for the digitisation of cultural heritage and to the standardisation of procedures. It will also propose measures to the European Union to foster the dissemination of European knowledge through centres of competence in digitisation, such as Open Planets FoundationPrestoCentreAPARSEN3D-COFORM Virtual Competence Centre, and V-MusT.net.


In addition to the University of Alicante, the consortium includes the following European institutions: the National Library of the Netherlands, the Dutch Institute of Lexicology, the Fraunhofer Gesellschaft, the Poznań Supercomputing Centre, the University of Salford, the Foundation Biblioteca Virtual Miguel de Cervantes Savedra, the French National Library and the British Library.


For additional information, please contact Rafael Carrasco (Universidad de Alicante) or send an email to succeed@ua.es.