KB Research

Research at the National Library of the Netherlands

Tag: Linked Data

Symposium Announcement: ‘Open Data for the Social Sciences and Humanities’

Clarin

Date: Friday October 30th 2015, 12.00 – 17.00 Location: National Library of the Netherlands in The Hague (KB), auditorium

The Talk of Europe – Travelling CLARIN Campus project aims to facilitate and stimulate pan-European collaboration in the Humanities, Social Sciences and Computer Science, based on the proceedings of the European Parliament (EP) by organising three international creative camps in 2014 and 2015. These proceedings are a rich source for humanities and social sciences researchers that focus on areas such as European History, integration and politics. Given their multilinguality they are also a rich source for linguists. The Talk of Europe (TOE) project team has made these proceeding available as Linked Data for reuse and research purposes. The creative camps intend to stimulate and explore this rich source by bringing together academics from the humanities, social sciences, computer science and related disciplines. The Talk of Europe project, an initiative of CLARIN ERIC and CLARIN-NL made possible by NWO and OCW support is a collaboration of the Erasmus University Rotterdam (EUR), VU University Amsterdam (VUA), National Research Institute for Mathematics and Computer Science (CWI), DANS and Netherlands Institute for Sound and Vision (NISV).
For more information, see: http://www.talkofeurope.eu/

The third and final Creative Camp will be organised from 26 – 30 October 2015 at the National Library of the Netherlands in The Hague. On Friday, October 30th a free public symposium will be held, titled ‘Open Data for the Social Sciences and Humanities’. All those interested are invited to attend. Participants can look forward to the following invited talks, which are sure to inspire and ignite discussion and debate:

12.00-13.00         Lunch buffet

13.00-13.45        ‘Measuring Political and Social Phenomena on the Web’
Presentation by prof. dr. Markus Strohmaier

Markus Strohmaier is a Full Prof. dr. Markus StrohmaierProfessor of Web-Science at the Faculty of Computer Science at University of Koblenz-Landau, Scientific Director of the Computational Social Science department at GESIS – the Leibniz Institute for the Social Sciences. His main research interests include Web-Science, Social and Semantic Computing, Social Software Engineering, Networks and Data Mining.

See: http://markusstrohmaier.info/

14.00-14.45       Presentations by two teams participating in Talk of Europe                                   Creative Camp #3

15.00-15.30       ‘Who killed whom in the Gaza war? Using syntactic information for relational corpus analysis’
Presentation by Wouter van Atteveldt and Kasper Welbers

vanatteveldtWouter van Atteveldt is assistant professor at the VU University of Amsterdam, department of Communication Sciences. He studies political communication, especially the antecedents and consequences of mass media coverage of political discourse. His research has a strong methodological focus on using AI / Computational NLP techniques to improve automatic text (content) analysis. For more information, see: http://vanatteveldt.com/

Kasper Wwelberselbers works at the VU University of Amsterdam as a PhD candidate. In his research he focuses on the changes in the gatekeeping process due to the proliferation of digital media technologies. Specifically, he studies the interaction between gatekeepers, by using automatic content analysis to trace news diffusion patterns.

15.30-16.00         Presentation by Maarten Brinkerink 

Maarten Brinkerbrinkerinkink is Specialist Public Participation and Innovative Access for the Department of Knowledge and Innovation at the Netherlands Institute for Sound and Vision. He coordinates the contribution of the institute in (inter) national research projects a
nd contributes to its strategic policy. Brinkerink strengthens the wider heritage sector by participating in initiatives such as Open Data and Culture Network Digital Heritage.
For more information, see: http://www.beeldengeluid.nl/en/kennis/experts/maarten-brinkerink

15.45-16.30       Drinks

There is no charge for this symposium (lunch included), but registration is requested. If you would like to attend the event, please send a short message to Jill Briggeman (briggeman@eshcc.eur.nl).
For more information, see: http://www.talkofeurope.eu/2015/10/symposium-announcement/

Address National Library (auditorium):
Prins Willem-Alexanderhof  5
2595 BE The Hague
Directions can be found here: https://www.kb.nl/en/visitors/address-and-directions

PoliMedia project with KB data wins LODLAM Open Data Prize

This post is written by Martijn Kleppe and adapted for the blog by Lotte Wilms

During the LODLAM Challenge in Sydney this summer, CLARIN-NL project PoliMedia has won the Open Data Prize. PoliMedia assists researchers in analysing the media coverage of debates in the Dutch Parliament. The system automatically generates links between the debates and coverage about those debates in newspapers and radio bulletins, which have been provided by the KB. Out of 39 entries, the jury judged PoliMedia to be the most innovative project because of its optimal use of semantic techniques and transparent process in opening up the generated links to other researchers.

Previously, PoliMedia won the Veni LinkedUp Challenge and was finalist of the Semantic Web Challenge. The project was a joined effort of the Erasmus University Rotterdam, TU Delft, Vrije Universiteit, the Netherlands Institute for Sound and Vision and the Koninklijke Bibliotheek, National Library of the Netherlands and has been financed by CLARIN-NL.

For more information, please visit www.polimedia.nl or see the video at http://polimedia.nl/about. Do you also want to use a KB data set for a project? Leave a comment below or contact us at dh @ kb . nl.

What’s happening with our digitised newspapers?

The KB has about 10 million digitised newspaper pages, ranging from 1650 until 1995. We negotiated rights to make these pages available for research and this has happened more and more over the past years. However, we thought that many of these projects might be interested in knowing what others are doing and we wanted to provide a networking opportunity for them to share their results. This is why we organised a newspapers symposium focusing on the digitised newspapers of the KB, which was a great success!

Prof. dr. Huub Wijfjes (RUG/UvA) showing word clouds used in his research.

Prof. dr. Huub Wijfjes (RUG/UvA) showing word clouds used in his research.

Continue reading

Linked Open Data at the National library of the Netherlands

Authors: Theo van Veen and Sieta Neuerburg

According to the National library of the Netherlands, libraries are positioned at the very core of the Semantic Web. In the words of Hans Jansen, Head of the library’s Innovation and Development department, “Linking data is the way forward for libraries. Any cultural heritage institution that does not invest in linking data will become obsolete.”

[vimeo 36752317 w=400 h=300]

Linked Open Data video by Europeana

The National library completed several successful Linked Data projects based on metadata linking. For example, for our newspaper app Here was the news (available in Dutch only), we added latitude and longitude data to our Dutch historical newspaper articles. The app allows users to search the newspaper collection by location. Moreover, the library has added links between its own journal collection and the TV and radio recordings of the Netherlands Institute for Sound and Vision. (This feature is not yet available on our website.)

While the work on linking our metadata continues, the library’s Research department is currently involved in an effort to link named entities in our full text collections. The purpose of this project is to contribute to a fully Linked Open Data-enabled library, entailing an enhanced user experience and improved discovery based on semantic relations. To further contribute to the progress of the semantic web, we will offer our full text enrichments as open data.

Linking named entities

To achieve our objectives, we need to be able to identify and link relevant named entities in our text collections. As part of the Europeana Newspapers project, we programmed a machine-learning tool to identify named entities in our full text collections. This software will allow us to extract all named entities from our full text collections and link them to related resources and resource descriptions. The software and documentation are available on GitHub.

The Research department created an enrichment database to collect information about the named entities and to store links to external resources. We are currently linking the named entities to DBpedia, while simultaneously storing links to related resource descriptions in Freebase and VIAF. This will be further extended to other resources, such as genealogy databases. Additionally, we will develop software for ‘socially enhanced linking’, i.e. tools allowing users to validate or reject links that were obtained automatically and to create new links for resources.

Challenges

Within the project, we still face some important challenges. A first problem is the issue of incomplete external coverage. Not all named entities are covered by resource description databases such as DBpedia and Freebase. Historical entities are especially neglected, and international databases such as DBpedia frequently omit even well-known Dutch figures. Moreover, there is no single global identifier for a resource. Resource description databases – such as DBpedia, Freebase and Geonames – all use their own identifiers, resulting in single resources leading to multiple resource descriptions.

Another issue that has to be dealt with is that of intellectual property rights. Ownership issues can hinder progress towards openness. Then there are problems of textual recognition difficulties. This is mostly related to OCR issues, but it also applies to historical language variations, name variants and other types of ambiguity. And finally, manual intervention is indispensable. We will need crowdsourcing to check, validate and correct links which have been automatically generated.

Expected results

The National library expects to achieve the most important results in two areas: first, enhanced discoverability and second, data enrichments. Both are essential to keep the library relevant in the digital age.

Linked Open Data is a powerful method to link digital heritage at a national or international level, especially because the data is openly available. Relations between resources transcend organisational, national and language barriers. The identification and linking of data helps to transform library collections into (machine and human readable) information and knowledge. This will allow for much richer search and discovery opportunities.

The National library’s Senior researcher Theo van Veen sees the future of Linked Open Data as “a single worldwide resource description database which will replace most or all bibliographic thesauri, with all resources mentioned in metadata or text linking to the same single identifier”.

Linked Open Data and the STCN

Author: Fernie Maas, VU University f.g.t.maas@vu.nl

In 2012, several projects were funded at VU University, the University of Amsterdam and the Royal Netherlands Academy of Arts and Sciences (KNAW), all under the umbrella of the Centre for Digital Humanities. The short and intensive research projects (approximately 9 months) combined methodologies from the traditional humanities disciplines with tools provided by computing and digital publishing. One of these projects, Innovative Strategies in a Stagnating Market. Dutch Book Trade 1660-1750, was based at VU University. Historians worked together with computer scientists of the Knowledge Representation and Reasoning Group in dealing with a specific dataset: the Short Title Catalogue, Netherlands (STCN). The project description and the research report (plus appendix) can be found here.

The project was set up within the research focus of the Golden Age cultural industries and dealt with the way early modern book producers interacted with the market, especially in times of stagnating demand and increasing competition. Book historians and economic historians, as well as scholars dealing with modern day cultural industries, have described several strategies that often occur when times are getting tough. A common denominator seems to be the constant search for a balance between on the one hand inventing new products, and on the other hand appealing to recognizable concepts. In short: differentiating, rather than revolutionizing, was (and is) seen as a key to survival. A case study was set up around the fictitious imprint of Marteau, an imprint used to cover up the provenance of controversial books. Contemporary book producers and authors had already noticed that the prohibition of, or suspicion around, certain books could spark a desire for exactly those books, eventually influencing sales.

Records in the STCN [http://bit.ly/1aHtBWs]

Records in the STCN [http://bit.ly/1aHtBWs]

The STCN is an important dataset for studying the early modern Dutch book trade and production, offering information about 200,000 titles in the period 1540-1800 (see ill. 1). The project team was provided with a bulk download of the STCN data, to work and play around with. This dataset was converted into a Resource Description Framework (RDF). RDF is a set of W3C specifications designed as a metadata data model. It is used as a conceptual description method in computing: described entities of the world are represented with nodes (e.g. “Dante Alighieri” or “The Divine Comedy”), while the relationships between these nodes are represented with edges connecting them (e.g. “Dante Alighieri” “wrote” “The Divine Comedy”). The redactiebladen (i.e. records) of the STCN have a very specific syntax of KMC’s (kenmerkcode), which contain information about author, title, place of publication, year of publication, etc. This syntax is interpreted in a program that reads the redactiebladen and gets the relevant properties about authors, titles, publishers, places, and the like out of them. Then it generates the RDF graph, linking all these entities together conveniently, and writes the results in a file. This file is exposed online, and it can be queried live by users using the query language SPARQL.

Size of titles under imprint of Marteau in the STCN [http://bit.ly/1bdzDsR]

Size of titles under imprint of Marteau in the STCN [http://bit.ly/1bdzDsR]

The RDF conversion makes it possible to query the data independently from the interface the STCN is offering. The regular interface of the STCN offers multiple ways of querying the data, especially in the ‘advanced search’ setting of the interface. However, the possibilities to filter and sort the data by using different properties are limited to a number of three fields, in combination with filtering on years of publication. A question as: in which size were publications under the Marteau-imprint mostly published, has to be broken down in several steps in the STCN, namely retrieving a list (and consequently a number) of Marteau-publications for each size used, separately. By querying the RDF-graph, this output can be retrieved in one go (see ill. 2). Also, this query structure allows for information to be visualized quite fast, for example the occurrence of Marteau-titles in the STCN, over time (see ill.3).

Titles with the fictitious imprint of Marteau in the STCN [http://bit.ly/1dJFDzq]

Titles with the fictitious imprint of Marteau in the STCN [http://bit.ly/1dJFDzq]

Publishing structured data by means of RDF is a component of the Linked Open Data approach, which means the converted STCN-dataset can be linked to other datasets. In linking the datasets, the provenance of the data stays intact, allowing for example to integrate updates of the dataset. Lists of forbidden, prohibited and condemned books (e.g. Knuttel) are in the process of being connected to the STCN, a link that could answers questions about the actual amount of Marteau-titles under investigation or suspicion. Also, combining and comparing the information about years and reasons of prohibition from the lists of forbidden books, with the information about date and place of publication in the STCN, could reconstruct a timeline of prohibition and publication, revealing a publishers’ strategy when the date of prohibition proceeds the date of publication.

The report mentioned above describes more examples, queries, and overall the rather exploratory course of the project. The pilot character of the project has allowed the team to explore the (im)possibilities of the dataset, to become aware of the importance of expert knowledge and to strengthen the collaboration between humanities researchers and computer scientists. Further research and collaboration with the STCN and book historians will be aimed at improving the infrastructure of the dataset, a better understanding of the statistical relevance of our queries, and a conceptualization of the relation between the publications, its producers, and its settings and editions.

© 2018 KB Research

Theme by Anders NorenUp ↑