Named entity recognition for digitised historical newspapers

Europeana NewspapersThe refinement partners in the Europeana Newspapers project will produce the astonishing amount of 10 million pages of full-text from historical newspapers from all over Europe. What could be done to further enrich that full-text?

The KB National Library of the Netherlands has been investigating named entity recognition (NER) and linked data technologies for a while now in projects such as IMPACT and STITCH+, and we felt it was about time to approach this on a production scale. So we decided to produce (open source) software, trained models as well as raw training data for NER software applications specifically for digitised historical newspapers as part of the project.

What is named entity recognition (NER)?

Named entity recognition is the process of identifying and classifying entities such as persons, locations and organisations in the full-text in order to enhance searchability. There are basically two types of approaches, a statistical and a rule based one. Rule based systems rely mostly on grammar rules defined by linguists, while statistical systems require large amounts of manually produced training data that they can learn from. While both approaches have their benefits and drawbacks, we decided to go for a statistical tool, the CRFNER system from Stanford University. In comparison, this software proved to be the most reliable, and it is supported by an active user community. Stanford University has an online demo where you can try it out: http://nlp.stanford.edu:8080/ner/.

ner

Example of Wikipedia article for Albert Einstein, tagged with the Stanford NER tool

Requirements & challenges

There are some particular requirements and challenges when applying these techniques to digital historical newspapers. Since full-text for more than 10 million pages will be produced in the project, one requirement for our NER tool was that it should be able to process large amounts of texts in a rather short time. This is possible with the Stanford tool,  which as of version 1.2.8 is “thread-safe”, i.e. it can run in parallel on a multi-core machine. Another requirement was to preserve the information about where on a page a named entity has been detected – based on coordinates. This is particularly important for newspapers: instead of having to go through all the articles on a newspaper page to find the named entity, it can be highlighted so that one can easily spot it even on very dense pages.

Then there are also challenges of course – mainly due to the quality of the OCR and the historical spelling that is found in many of these old newspapers. In the course of 2014 we will thus collaborate with the Dutch Institute for Lexicology (INL), who have produced modules which can be used in a pre-processing step before the Stanford system and that can to some extent mitigate problems caused by low quality of the full-text or the appearance of historical spelling variants.

The Europeana Newspapers NER workflow

For Europeana Newspapers, we decided to focus on three languages: Dutch, French and German. The content in these three languages makes up for about half of the newspaper pages that will become available through Europeana Newspapers. For the French materials, we cooperate with LIP6-ACASA, for Dutch again with INL. The workflow goes like this:

  1. We receive OCR results in ALTO format (or METS/MPEG21-DIDL containers)
  2. We process the OCR with our NER software to derive a pre-tagged corpus
  3. We upload the pre-tagged corpus into an online Attestation Tool (provided by INL)
  4. Within the Attestation Tool, the libraries make corrections and add tags until we arrive at a “gold corpus”, i.e. all named entities on the pages have been manually marked
  5. We train our NER software based on the gold corpus derived in step (4)
  6. We process the OCR again with our NER software trained on the gold corpus
  7. We repeat steps (2) – (6) until the results of the tagging won’t improve any further

    NER slide

    Screenshot of the NER Attestation Tool

Preliminary results

Named entity recognition is typically evaluated by means of Precision/Recall and F-measure. Precision gives an account of how many of the named entities that the software found are in fact named entities of the correct type, while Recall states how many of the total amount of named entities present have been detected by the software. The F-measure then combines both scores into a weighted average between 0 – 1. Here are our (preliminary) results for Dutch so far:

Dutch

Persons

Locations

Organizations

Precision

0.940

0.950

0.942

Recall

0.588

0.760

0.559

F-measure

0.689

0.838

0.671

These figures have been derived from a k-fold cross-evaluation of 25 out of 100 manually tagged pages of Dutch newspapers from the KB. The results confirm the fact that the Stanford system tends to be a bit “conservative”, i.e. it has a somewhat lower recall for the benefit of higher precision, which is also what we wanted.

Conclusion and outlook

Within this final year of the project we are looking forward to see in how far we can still boost these figures by adopting the extra modules from INL, and what results we can achieve on the French and German newspapers. We will also investigate software for linking the named entities to additional online resource descriptions and authority files such as DBPedia or VIAF to create Linked Data. The crucial question will be how well we can disambiguate the named entities and find a correct match in these resources. Besides, if there is time, we would also want to experiment with NER in other languages, such as Serbian or Latvian. And, if all goes well, you might already hear more about this at the upcoming IFLA newspapers conference “Digital transformation and the changing role of news media in the 21st Century“.

References

Presenting European Historic Newspapers Online

As was posted earlier on this blog, the KB participates in the European project Europeana Newspapers. In this project, we are working together with 17 other institutions (libraries, technical partners and networking partners) to make 18 million European newspapers pages available via Europeana on title level. Next to this, The European Library is working on a specifically built portal to also make the newspapers available as full-text. However, many of the libraries do not have OCR for their newspapers yet, which is why the project is working together with the University of Innsbruck, CCS Content Conversion Specialists GmbH from Hamburg and the KB to enrich these pages with OCR, Optical Layout Recognition (OLR), and Named Entity Recognition (NER).

Hans-Jörg Lieder

Hans-Jorg Lieder of the Berlin State Library presents the Europeana Newspapers Project at our September 2013 workshop in Amsterdam.

In June, the project had a workshop on refinement, but it was now time to discuss aggregation and presentation. This workshop took place in Amsterdam on 16 September, during The European Library Annual Event. There was a good group of people, not only from the project partners and the associated partners, but also from outside the consortium. After the project, TEL hopes to be able to also offer these institutions a chance to send in their newspapers for Europeana, so we were very happy to have them join us.

The workshop kicked off with an introduction from Marieke Willems of LIBER and Hans-Joerg Lieder of the Berlin State Library.. They were followed by Markus Muhr from TEL, who introduced the aggregation plan and the schedule for the project partners. With so many partners, it can be quite difficult to find a schedule that works well, to ensure everyone has their material sent in on time. After the aggregation, TEL will then have to do some work on the metadata to convert it to the Europeana Data Model. Markus was followed by a presentation from Channa Veldhuijsen from the KB, who unfortunately, could not be there in person. However, her elaborate presentation on usability testing provided some good insights on how to get your website to be the best it can be and how to find out what your users really think when they are browsing your site.

It was then time for Alastair Dunning from TEL to showcase the portal that they have been preparing for Europeana Newspapers. Unfortunately, the wifi connection was not up to so many visitors and only some people could follow his presentation along on their own devices. However, there were some valuable feedback points which TEL will use to improve the portal. Unfortunately, the portal is not yet available from outside, so people who missed the presentation need to wait a bit longer to be able to see and browse the European newspapers.

But what we do already can see, are some websites of partners that have already been online for some time. It was very interesting to see the different choices each partner made to showcase their collection. We heard from people from the British Library, the National and University Library of Iceland, the National and University Library of Slovenia, the National Library of Luxembourg and the National Library of the Czech Republic.

P1100058

Yves Mauer from the National Library of Luxembourg presenting their newspaper portal

The day ended with a lovely presentation by Dean Birkett of Europeana, who, partly with Channa’s notes, went to all the previously presented websites and offered comments on how to improve them. The videos he used in his talk are available on Youtube. His key points were:

  1. Make the type size large: 16px is the recommended size.
  2. Be careful of colours. Some online newspapers sites use red to highlight important information but red is normally associated with warning signals and errors in the user’s mind.
  3. Use words to indicate language choices (eg. ‘english’, ‘français’) not flags. The Spanish flag won’t necessarily be interpreted to mean ‘click here for spanish’ if the user is from Mexico.
  4. Cut down on unnecessary text. Make it easy for users to skim (eg. though the use of bullet points).

All in all, it was a very useful afternoon in which I learned a lot about what users want from a website. If you want to see more, all presentations can be found at the Slideshare account of Europeana Newspapers or join us at one of the following events:

  • Workshop on Newspapers in Europe and the Digital Agenda. British Library, London. September 29-30th, 2014.
  • National Information Days.
    • National Library of Austria. March 25-26th, 2014.
    • National Library of France. April 3rd, 2014.
    • British Library. June 9th, 2014.

Europeana Newspapers Refinement & Aggregation Workshop

The KB participates in the Europeana Newspapers project that has started in February 2012. The project will enrich 18 million pages of digitised newspapers with Optical Character Recognition (OCR), Optical Layout Recognition (OLR) and Named Entity Recognition (NER) from all over Europe and deliver them to Europeana. The project consortium consists of 18 partners from all over Europe: some will provide (technical) support, while other will provide their digitised newspapers. The KB has two roles: we will not only deliver 2 million of our newspaper pages to Europeana, but we will also enrich ours and the newspapers of other partners with NER.

Untitled

Europeana Newspapers Workshop in Belgrade

In the last months, the project has welcomed 11 new associated partners and to make sure they can benefit as much as possible from the experiences of the project partners the University Library of Belgrade and LIBER jointly organised a workshop on refinement and aggregation on 13 and 14 June. Here, the KB (Clemens Neudecker and I) presented the work that is currently being done to make sure that we will have Named Entities for several partners. To make sure that the work that is being done in the project also benefits our direct colleagues, we were joined by someone from our Digitisation department.

The workshop started with a warm welcome in Belgrade by the director of the library, Prof. Aleksandar Jerkov. After a short introduction into the project by the project leader Hans-Jörg Lieder from the State Library Berlin, Clemens Neudecker from the KB presented the refinement process of the project. All presentations will be shared on the project’s Slideshare account. The refinement of the newspapers has already started and is being done by the University of Innsbruck and the company CCS in Hamburg. However, it was still a big surprise when Hans-Jörg Lieder announced a present for the director of the University Library Belgrade; the first batch of their processed newspapers!

Giving a gift of 200,000 digitised and refined newspapers to our Belgrade hosts

Giving a gift of 200,000 digitised and refined newspapers to our Belgrade hosts

The day continued with an introduction into the importance of evaluation of OCR and OLR and a demonstration of the tools used for this by Stefan Pletschacher and Cristian Clausner from the University of Salford. This sparked some interesting discussions in the break-out sessions on methods of evaluation in the libraries digitising their collections. For example, do you tell your service provider what you will be checking when you receive a batch? You could argue that the service provider would then only fix what you check. On the other hand if that is what you need to reach your goal it would save a lot of time and rejected batches.

After a short getting-to-know-each-other session the whole workshop party moved to the Nikola Tesla Museum nearby where we were introduced to their newspaper clippings project. All newspaper clippings collected by Nikola Tesla are now being digitised for publication on the museum’s website. A nice tour through the museum followed with several demonstrations (don’t worry, no one was electrocuted) and the day was concluded with a dinner in the bohemian quarter.

Breakout groups at the Belgrade Workshop

The second day of the workshop was dedicated solely to refinement. I kicked off the day with the question ‘What is a named entity?’. This sounds easy, but can provide you with some dilemmas as well. For example, a dog’s name is a name, but do you want it to be tagged as a NE? And what do you do with a title such as Romeo and Juliet? Consistency is key in this and as long as you keep your goal in mind while training your software you should end up with the results you are looking for.

Claus Gravenhorst followed me with his presentation on OLR at CCS, by using docWorks, with which they will process 2 million pages. It was then again our turn with a hands-on session about the tools we’re using, which are also available on Github. The last session of the workshop was a collaboration between Claus Gravenhorst from CCS and Günter Mühlberger from the University of Innsbruck who gave us a nice insight into their tools and the considerations made when working with digitised newspapers. For example, how many categories would you need to tag every article?

Group photo from the Europeana Newspapers workshop in Belgrade

All in all, it was a very successful workshop and I hope that all participants enjoyed it as much as I have. I at least am happy to have spoken to so many interesting people with new experiences from other digitisation projects. There is still much to learn from each other and projects like Europeana Newspapers contribute towards a good exchange of knowledge between libraries to ensure our users get the best experience when browsing through the rich digital collections.