Linked Open Data at the National library of the Netherlands

Authors: Theo van Veen and Sieta Neuerburg

According to the National library of the Netherlands, libraries are positioned at the very core of the Semantic Web. In the words of Hans Jansen, Head of the library’s Innovation and Development department, “Linking data is the way forward for libraries. Any cultural heritage institution that does not invest in linking data will become obsolete.”

Linked Open Data video by Europeana

The National library completed several successful Linked Data projects based on metadata linking. For example, for our newspaper app Here was the news (available in Dutch only), we added latitude and longitude data to our Dutch historical newspaper articles. The app allows users to search the newspaper collection by location. Moreover, the library has added links between its own journal collection and the TV and radio recordings of the Netherlands Institute for Sound and Vision. (This feature is not yet available on our website.)

While the work on linking our metadata continues, the library’s Research department is currently involved in an effort to link named entities in our full text collections. The purpose of this project is to contribute to a fully Linked Open Data-enabled library, entailing an enhanced user experience and improved discovery based on semantic relations. To further contribute to the progress of the semantic web, we will offer our full text enrichments as open data.

Linking named entities

To achieve our objectives, we need to be able to identify and link relevant named entities in our text collections. As part of the Europeana Newspapers project, we programmed a machine-learning tool to identify named entities in our full text collections. This software will allow us to extract all named entities from our full text collections and link them to related resources and resource descriptions. The software and documentation are available on GitHub.

The Research department created an enrichment database to collect information about the named entities and to store links to external resources. We are currently linking the named entities to DBpedia, while simultaneously storing links to related resource descriptions in Freebase and VIAF. This will be further extended to other resources, such as genealogy databases. Additionally, we will develop software for ‘socially enhanced linking’, i.e. tools allowing users to validate or reject links that were obtained automatically and to create new links for resources.

Challenges

Within the project, we still face some important challenges. A first problem is the issue of incomplete external coverage. Not all named entities are covered by resource description databases such as DBpedia and Freebase. Historical entities are especially neglected, and international databases such as DBpedia frequently omit even well-known Dutch figures. Moreover, there is no single global identifier for a resource. Resource description databases – such as DBpedia, Freebase and Geonames – all use their own identifiers, resulting in single resources leading to multiple resource descriptions.

Another issue that has to be dealt with is that of intellectual property rights. Ownership issues can hinder progress towards openness. Then there are problems of textual recognition difficulties. This is mostly related to OCR issues, but it also applies to historical language variations, name variants and other types of ambiguity. And finally, manual intervention is indispensable. We will need crowdsourcing to check, validate and correct links which have been automatically generated.

Expected results

The National library expects to achieve the most important results in two areas: first, enhanced discoverability and second, data enrichments. Both are essential to keep the library relevant in the digital age.

Linked Open Data is a powerful method to link digital heritage at a national or international level, especially because the data is openly available. Relations between resources transcend organisational, national and language barriers. The identification and linking of data helps to transform library collections into (machine and human readable) information and knowledge. This will allow for much richer search and discovery opportunities.

The National library’s Senior researcher Theo van Veen sees the future of Linked Open Data as “a single worldwide resource description database which will replace most or all bibliographic thesauri, with all resources mentioned in metadata or text linking to the same single identifier”.

Named entity recognition for digitised historical newspapers

Europeana NewspapersThe refinement partners in the Europeana Newspapers project will produce the astonishing amount of 10 million pages of full-text from historical newspapers from all over Europe. What could be done to further enrich that full-text?

The KB National Library of the Netherlands has been investigating named entity recognition (NER) and linked data technologies for a while now in projects such as IMPACT and STITCH+, and we felt it was about time to approach this on a production scale. So we decided to produce (open source) software, trained models as well as raw training data for NER software applications specifically for digitised historical newspapers as part of the project.

What is named entity recognition (NER)?

Named entity recognition is the process of identifying and classifying entities such as persons, locations and organisations in the full-text in order to enhance searchability. There are basically two types of approaches, a statistical and a rule based one. Rule based systems rely mostly on grammar rules defined by linguists, while statistical systems require large amounts of manually produced training data that they can learn from. While both approaches have their benefits and drawbacks, we decided to go for a statistical tool, the CRFNER system from Stanford University. In comparison, this software proved to be the most reliable, and it is supported by an active user community. Stanford University has an online demo where you can try it out: http://nlp.stanford.edu:8080/ner/.

ner

Example of Wikipedia article for Albert Einstein, tagged with the Stanford NER tool

Requirements & challenges

There are some particular requirements and challenges when applying these techniques to digital historical newspapers. Since full-text for more than 10 million pages will be produced in the project, one requirement for our NER tool was that it should be able to process large amounts of texts in a rather short time. This is possible with the Stanford tool,  which as of version 1.2.8 is “thread-safe”, i.e. it can run in parallel on a multi-core machine. Another requirement was to preserve the information about where on a page a named entity has been detected – based on coordinates. This is particularly important for newspapers: instead of having to go through all the articles on a newspaper page to find the named entity, it can be highlighted so that one can easily spot it even on very dense pages.

Then there are also challenges of course – mainly due to the quality of the OCR and the historical spelling that is found in many of these old newspapers. In the course of 2014 we will thus collaborate with the Dutch Institute for Lexicology (INL), who have produced modules which can be used in a pre-processing step before the Stanford system and that can to some extent mitigate problems caused by low quality of the full-text or the appearance of historical spelling variants.

The Europeana Newspapers NER workflow

For Europeana Newspapers, we decided to focus on three languages: Dutch, French and German. The content in these three languages makes up for about half of the newspaper pages that will become available through Europeana Newspapers. For the French materials, we cooperate with LIP6-ACASA, for Dutch again with INL. The workflow goes like this:

  1. We receive OCR results in ALTO format (or METS/MPEG21-DIDL containers)
  2. We process the OCR with our NER software to derive a pre-tagged corpus
  3. We upload the pre-tagged corpus into an online Attestation Tool (provided by INL)
  4. Within the Attestation Tool, the libraries make corrections and add tags until we arrive at a “gold corpus”, i.e. all named entities on the pages have been manually marked
  5. We train our NER software based on the gold corpus derived in step (4)
  6. We process the OCR again with our NER software trained on the gold corpus
  7. We repeat steps (2) – (6) until the results of the tagging won’t improve any further

    NER slide

    Screenshot of the NER Attestation Tool

Preliminary results

Named entity recognition is typically evaluated by means of Precision/Recall and F-measure. Precision gives an account of how many of the named entities that the software found are in fact named entities of the correct type, while Recall states how many of the total amount of named entities present have been detected by the software. The F-measure then combines both scores into a weighted average between 0 – 1. Here are our (preliminary) results for Dutch so far:

Dutch

Persons

Locations

Organizations

Precision

0.940

0.950

0.942

Recall

0.588

0.760

0.559

F-measure

0.689

0.838

0.671

These figures have been derived from a k-fold cross-evaluation of 25 out of 100 manually tagged pages of Dutch newspapers from the KB. The results confirm the fact that the Stanford system tends to be a bit “conservative”, i.e. it has a somewhat lower recall for the benefit of higher precision, which is also what we wanted.

Conclusion and outlook

Within this final year of the project we are looking forward to see in how far we can still boost these figures by adopting the extra modules from INL, and what results we can achieve on the French and German newspapers. We will also investigate software for linking the named entities to additional online resource descriptions and authority files such as DBPedia or VIAF to create Linked Data. The crucial question will be how well we can disambiguate the named entities and find a correct match in these resources. Besides, if there is time, we would also want to experiment with NER in other languages, such as Serbian or Latvian. And, if all goes well, you might already hear more about this at the upcoming IFLA newspapers conference “Digital transformation and the changing role of news media in the 21st Century“.

References