KB Research

Research at the National Library of the Netherlands

Month: August 2014

Linked Open Data at the National library of the Netherlands

Authors: Theo van Veen and Sieta Neuerburg

According to the National library of the Netherlands, libraries are positioned at the very core of the Semantic Web. In the words of Hans Jansen, Head of the library’s Innovation and Development department, “Linking data is the way forward for libraries. Any cultural heritage institution that does not invest in linking data will become obsolete.”

[vimeo 36752317 w=400 h=300]

Linked Open Data video by Europeana

The National library completed several successful Linked Data projects based on metadata linking. For example, for our newspaper app Here was the news (available in Dutch only), we added latitude and longitude data to our Dutch historical newspaper articles. The app allows users to search the newspaper collection by location. Moreover, the library has added links between its own journal collection and the TV and radio recordings of the Netherlands Institute for Sound and Vision. (This feature is not yet available on our website.)

While the work on linking our metadata continues, the library’s Research department is currently involved in an effort to link named entities in our full text collections. The purpose of this project is to contribute to a fully Linked Open Data-enabled library, entailing an enhanced user experience and improved discovery based on semantic relations. To further contribute to the progress of the semantic web, we will offer our full text enrichments as open data.

Linking named entities

To achieve our objectives, we need to be able to identify and link relevant named entities in our text collections. As part of the Europeana Newspapers project, we programmed a machine-learning tool to identify named entities in our full text collections. This software will allow us to extract all named entities from our full text collections and link them to related resources and resource descriptions. The software and documentation are available on GitHub.

The Research department created an enrichment database to collect information about the named entities and to store links to external resources. We are currently linking the named entities to DBpedia, while simultaneously storing links to related resource descriptions in Freebase and VIAF. This will be further extended to other resources, such as genealogy databases. Additionally, we will develop software for ‘socially enhanced linking’, i.e. tools allowing users to validate or reject links that were obtained automatically and to create new links for resources.

Challenges

Within the project, we still face some important challenges. A first problem is the issue of incomplete external coverage. Not all named entities are covered by resource description databases such as DBpedia and Freebase. Historical entities are especially neglected, and international databases such as DBpedia frequently omit even well-known Dutch figures. Moreover, there is no single global identifier for a resource. Resource description databases – such as DBpedia, Freebase and Geonames – all use their own identifiers, resulting in single resources leading to multiple resource descriptions.

Another issue that has to be dealt with is that of intellectual property rights. Ownership issues can hinder progress towards openness. Then there are problems of textual recognition difficulties. This is mostly related to OCR issues, but it also applies to historical language variations, name variants and other types of ambiguity. And finally, manual intervention is indispensable. We will need crowdsourcing to check, validate and correct links which have been automatically generated.

Expected results

The National library expects to achieve the most important results in two areas: first, enhanced discoverability and second, data enrichments. Both are essential to keep the library relevant in the digital age.

Linked Open Data is a powerful method to link digital heritage at a national or international level, especially because the data is openly available. Relations between resources transcend organisational, national and language barriers. The identification and linking of data helps to transform library collections into (machine and human readable) information and knowledge. This will allow for much richer search and discovery opportunities.

The National library’s Senior researcher Theo van Veen sees the future of Linked Open Data as “a single worldwide resource description database which will replace most or all bibliographic thesauri, with all resources mentioned in metadata or text linking to the same single identifier”.

OCR improvement: helping and hindering researchers

Author: Tineke Koster

As I am writing this, volunteers are rekeying our 17th century newspapers articles. Optical character recognition of the gothic text type in use at the time has yielded poor results, making this part of our digital collection nearly inaccessible for full-text search. The Meertens institute, who have an excellent track record when it comes to crowdsourcing, has developed the editor (Dutch). Together with them we are working towards a full update of all newspaper issues from 1618 to 1700 that are available in our website Delpher.

Great news and, for some researchers, an eagerly awaited development. A bright future beckons in which our digital text corpus is 100% correct, just waiting to be mined for dynamic phenomena and paradigm shifts.

But we have to realize that without the proper precautions, correcting digital texts may also hinder researchers in their work. How so? These texts may have been used (browsed, mined, cited, etc.) by researchers in their earlier form. The improvement or enrichment may have consequences for the reproducibility of their research results.

For all researchers the need to reproduce research results is growing, with new guidelines due to new laws. There is also a specific group of researchers that need sustained access to older versions of digital text. The need is highest for research where the goal is to develop an algorithm and to assess its quality relative to previous versions of the same algorithm or to other algorithms. Without sustained access to older versions, these people cannot do their work.

Is it our role to provide this access? How the National Library of the Netherlands is thinking about this issue, I hope to explain in a later blogpost (soon!). Meanwhile, I would be very interested to hear your experiences. How is this subject discussed in your organization? Does your organization have a policy in place to deal with this?

© 2018 KB Research

Theme by Anders NorenUp ↑