Authors: Theo van Veen and Sieta Neuerburg
According to the National library of the Netherlands, libraries are positioned at the very core of the Semantic Web. In the words of Hans Jansen, Head of the library’s Innovation and Development department, “Linking data is the way forward for libraries. Any cultural heritage institution that does not invest in linking data will become obsolete.”
Linked Open Data video by Europeana
The National library completed several successful Linked Data projects based on metadata linking. For example, for our newspaper app Here was the news (available in Dutch only), we added latitude and longitude data to our Dutch historical newspaper articles. The app allows users to search the newspaper collection by location. Moreover, the library has added links between its own journal collection and the TV and radio recordings of the Netherlands Institute for Sound and Vision. (This feature is not yet available on our website.)
While the work on linking our metadata continues, the library’s Research department is currently involved in an effort to link named entities in our full text collections. The purpose of this project is to contribute to a fully Linked Open Data-enabled library, entailing an enhanced user experience and improved discovery based on semantic relations. To further contribute to the progress of the semantic web, we will offer our full text enrichments as open data.
Linking named entities
To achieve our objectives, we need to be able to identify and link relevant named entities in our text collections. As part of the Europeana Newspapers project, we programmed a machine-learning tool to identify named entities in our full text collections. This software will allow us to extract all named entities from our full text collections and link them to related resources and resource descriptions. The software and documentation are available on GitHub.
The Research department created an enrichment database to collect information about the named entities and to store links to external resources. We are currently linking the named entities to DBpedia, while simultaneously storing links to related resource descriptions in Freebase and VIAF. This will be further extended to other resources, such as genealogy databases. Additionally, we will develop software for ‘socially enhanced linking’, i.e. tools allowing users to validate or reject links that were obtained automatically and to create new links for resources.
Within the project, we still face some important challenges. A first problem is the issue of incomplete external coverage. Not all named entities are covered by resource description databases such as DBpedia and Freebase. Historical entities are especially neglected, and international databases such as DBpedia frequently omit even well-known Dutch figures. Moreover, there is no single global identifier for a resource. Resource description databases – such as DBpedia, Freebase and Geonames – all use their own identifiers, resulting in single resources leading to multiple resource descriptions.
Another issue that has to be dealt with is that of intellectual property rights. Ownership issues can hinder progress towards openness. Then there are problems of textual recognition difficulties. This is mostly related to OCR issues, but it also applies to historical language variations, name variants and other types of ambiguity. And finally, manual intervention is indispensable. We will need crowdsourcing to check, validate and correct links which have been automatically generated.
The National library expects to achieve the most important results in two areas: first, enhanced discoverability and second, data enrichments. Both are essential to keep the library relevant in the digital age.
Linked Open Data is a powerful method to link digital heritage at a national or international level, especially because the data is openly available. Relations between resources transcend organisational, national and language barriers. The identification and linking of data helps to transform library collections into (machine and human readable) information and knowledge. This will allow for much richer search and discovery opportunities.
The National library’s Senior researcher Theo van Veen sees the future of Linked Open Data as “a single worldwide resource description database which will replace most or all bibliographic thesauri, with all resources mentioned in metadata or text linking to the same single identifier”.