PoliMedia project with KB data wins LODLAM Open Data Prize

This post is written by Martijn Kleppe and adapted for the blog by Lotte Wilms

During the LODLAM Challenge in Sydney this summer, CLARIN-NL project PoliMedia has won the Open Data Prize. PoliMedia assists researchers in analysing the media coverage of debates in the Dutch Parliament. The system automatically generates links between the debates and coverage about those debates in newspapers and radio bulletins, which have been provided by the KB. Out of 39 entries, the jury judged PoliMedia to be the most innovative project because of its optimal use of semantic techniques and transparent process in opening up the generated links to other researchers.

Previously, PoliMedia won the Veni LinkedUp Challenge and was finalist of the Semantic Web Challenge. The project was a joined effort of the Erasmus University Rotterdam, TU Delft, Vrije Universiteit, the Netherlands Institute for Sound and Vision and the Koninklijke Bibliotheek, National Library of the Netherlands and has been financed by CLARIN-NL.

For more information, please visit www.polimedia.nl or see the video at http://polimedia.nl/about. Do you also want to use a KB data set for a project? Leave a comment below or contact us at dh @ kb . nl.

Digital Humanities at the KB

As promised, a blog about our poster at DHBenelux, but I didn’t want to simply publish the poster, so here is the explanation that goes with in. The abstract we submitted was a very general story about what happens in the KB with regards to Digital Humanities and the poster we developed out of this is one we hope you’ll see more often, because we love talking to you and promoting our stuff! But what was on this poster and what is actually happening with DH in the KB? In our strategic plan for 2015-2018 we refer to the Digital Humanists as the top layer of our user pyramid:

The top layer is formed by a relatively small, but growing group. They are researchers and developers who use the large textual data sets that the KB has built up with its partners during the past few years. More and more humanities researchers use tools to extract information and visualize data, to get a grip on data sets that can no longer be analyzed in the traditional way (big data). The KB actively supports this form of Humanities, Digital Humanities. (p. 10)

 

Continue reading

DH Benelux 2015: Tools, research, reflection

This blog post was written by Adeline van den Berg and Lotte Wilms

On 8 and 9 June 2015, the second DH Benelux conference took place, bringing approx. 150 Digital Humanists together in the beautiful building of the University of Antwerp. Apart from great lunches, conversations and a poster reception alongside penguins and flamingos, a few things stood out for us. Below we sum up what we think were common threads, with the help of some tweets.

Continue reading

Supporting History Research with Temporal Topic Previews at Querying Time

This post is written by Dr. Jiyin He – Researcher-in-residence at the KB Research Lab from June – October 2014.

Being able to study primary sources is pivotal to the work of historians. Today’s mass digitisation of historical records such as books, newspapers, and pamphlets now provides researchers with the opportunity to study an unprecedented amount of material without the need for physical access to archives. Access to this material is provided through search systems, however, the effectiveness of such systems seems to lag behind the major web search engines. Some of the things that make web search engines so effective are redundancy of information, that popular material is often considered relevant material, and that the preferences of other users may be used to determine what you would find relevant. These properties do not hold or are unavailable for collections of historical material. In the past 3 months I have worked at the KB as a guest researcher. Together with Dr. Samuël Kruizinga, a historian, we explored how we can enhance the search system at KB to assist the search challenges of the historian. In this blogpost, I will share our experience of working together, the system we have developed, as well as lessons learnt during this project.

Continue reading

Researcher-in-residence at the KB

At DH2013, we presented a poster to ask researchers what they need from a National Library. The responses varied from ‘Nothing, just give us your data’ to ‘We’d like to be fully supported with tools and services’, showing once again that different users have different requirements. In order to accommodate all groups of researchers, the Collections department of the KB, who ‘own’ the data, and the Research department, where tools and services are developed, combined efforts and spoke to scholars to  discuss the best method of supporting their work. However, we noticed that it was still quite difficult to get a good idea of how they used our data and in what way our actions and decisions would benefit them. Also, it seemed that researchers were often not aware of what activities the we undertake in this respect, which led to work being done twice.

Continue reading

What’s happening with our digitised newspapers?

The KB has about 10 million digitised newspaper pages, ranging from 1650 until 1995. We negotiated rights to make these pages available for research and this has happened more and more over the past years. However, we thought that many of these projects might be interested in knowing what others are doing and we wanted to provide a networking opportunity for them to share their results. This is why we organised a newspapers symposium focusing on the digitised newspapers of the KB, which was a great success!

Prof. dr. Huub Wijfjes (RUG/UvA) showing word clouds used in his research.

Prof. dr. Huub Wijfjes (RUG/UvA) showing word clouds used in his research.

Continue reading

Workshop topic modelling with MALLET at KB

[A] topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents (Wikipedia).

Topic modelling is a very popular method in the Digital Humanities to discover more about a large set of data and is also used by many researchers working on data of the KB. Unfortunately, not all topic modelling tools are as easy to access, due to a lack of technical skills or a lack of access to the data for example. The current guest researcher at the KB (Dr. Samuël Kruizinga) came across such problems while doing his research into the memory of the First World War in the KB newspapers. Not only was it difficult for him to select a corpus to work with, he was also unfamiliar with the go-to tool MALLET. Luckily, his university (Universiteit van Amsterdam) wanted to help and provided funds to organise a workshop, not only for him, but also for other academics interested in topic modelling.

Continue reading

Linked Open Data and the STCN

Author: Fernie Maas, VU University f.g.t.maas@vu.nl

In 2012, several projects were funded at VU University, the University of Amsterdam and the Royal Netherlands Academy of Arts and Sciences (KNAW), all under the umbrella of the Centre for Digital Humanities. The short and intensive research projects (approximately 9 months) combined methodologies from the traditional humanities disciplines with tools provided by computing and digital publishing. One of these projects, Innovative Strategies in a Stagnating Market. Dutch Book Trade 1660-1750, was based at VU University. Historians worked together with computer scientists of the Knowledge Representation and Reasoning Group in dealing with a specific dataset: the Short Title Catalogue, Netherlands (STCN). The project description and the research report (plus appendix) can be found here.

The project was set up within the research focus of the Golden Age cultural industries and dealt with the way early modern book producers interacted with the market, especially in times of stagnating demand and increasing competition. Book historians and economic historians, as well as scholars dealing with modern day cultural industries, have described several strategies that often occur when times are getting tough. A common denominator seems to be the constant search for a balance between on the one hand inventing new products, and on the other hand appealing to recognizable concepts. In short: differentiating, rather than revolutionizing, was (and is) seen as a key to survival. A case study was set up around the fictitious imprint of Marteau, an imprint used to cover up the provenance of controversial books. Contemporary book producers and authors had already noticed that the prohibition of, or suspicion around, certain books could spark a desire for exactly those books, eventually influencing sales.

Records in the STCN [http://bit.ly/1aHtBWs]

Records in the STCN [http://bit.ly/1aHtBWs]

The STCN is an important dataset for studying the early modern Dutch book trade and production, offering information about 200,000 titles in the period 1540-1800 (see ill. 1). The project team was provided with a bulk download of the STCN data, to work and play around with. This dataset was converted into a Resource Description Framework (RDF). RDF is a set of W3C specifications designed as a metadata data model. It is used as a conceptual description method in computing: described entities of the world are represented with nodes (e.g. “Dante Alighieri” or “The Divine Comedy”), while the relationships between these nodes are represented with edges connecting them (e.g. “Dante Alighieri” “wrote” “The Divine Comedy”). The redactiebladen (i.e. records) of the STCN have a very specific syntax of KMC’s (kenmerkcode), which contain information about author, title, place of publication, year of publication, etc. This syntax is interpreted in a program that reads the redactiebladen and gets the relevant properties about authors, titles, publishers, places, and the like out of them. Then it generates the RDF graph, linking all these entities together conveniently, and writes the results in a file. This file is exposed online, and it can be queried live by users using the query language SPARQL.

Size of titles under imprint of Marteau in the STCN [http://bit.ly/1bdzDsR]

Size of titles under imprint of Marteau in the STCN [http://bit.ly/1bdzDsR]

The RDF conversion makes it possible to query the data independently from the interface the STCN is offering. The regular interface of the STCN offers multiple ways of querying the data, especially in the ‘advanced search’ setting of the interface. However, the possibilities to filter and sort the data by using different properties are limited to a number of three fields, in combination with filtering on years of publication. A question as: in which size were publications under the Marteau-imprint mostly published, has to be broken down in several steps in the STCN, namely retrieving a list (and consequently a number) of Marteau-publications for each size used, separately. By querying the RDF-graph, this output can be retrieved in one go (see ill. 2). Also, this query structure allows for information to be visualized quite fast, for example the occurrence of Marteau-titles in the STCN, over time (see ill.3).

Titles with the fictitious imprint of Marteau in the STCN [http://bit.ly/1dJFDzq]

Titles with the fictitious imprint of Marteau in the STCN [http://bit.ly/1dJFDzq]

Publishing structured data by means of RDF is a component of the Linked Open Data approach, which means the converted STCN-dataset can be linked to other datasets. In linking the datasets, the provenance of the data stays intact, allowing for example to integrate updates of the dataset. Lists of forbidden, prohibited and condemned books (e.g. Knuttel) are in the process of being connected to the STCN, a link that could answers questions about the actual amount of Marteau-titles under investigation or suspicion. Also, combining and comparing the information about years and reasons of prohibition from the lists of forbidden books, with the information about date and place of publication in the STCN, could reconstruct a timeline of prohibition and publication, revealing a publishers’ strategy when the date of prohibition proceeds the date of publication.

The report mentioned above describes more examples, queries, and overall the rather exploratory course of the project. The pilot character of the project has allowed the team to explore the (im)possibilities of the dataset, to become aware of the importance of expert knowledge and to strengthen the collaboration between humanities researchers and computer scientists. Further research and collaboration with the STCN and book historians will be aimed at improving the infrastructure of the dataset, a better understanding of the statistical relevance of our queries, and a conceptualization of the relation between the publications, its producers, and its settings and editions.

Presenting European Historic Newspapers Online

As was posted earlier on this blog, the KB participates in the European project Europeana Newspapers. In this project, we are working together with 17 other institutions (libraries, technical partners and networking partners) to make 18 million European newspapers pages available via Europeana on title level. Next to this, The European Library is working on a specifically built portal to also make the newspapers available as full-text. However, many of the libraries do not have OCR for their newspapers yet, which is why the project is working together with the University of Innsbruck, CCS Content Conversion Specialists GmbH from Hamburg and the KB to enrich these pages with OCR, Optical Layout Recognition (OLR), and Named Entity Recognition (NER).

Hans-Jörg Lieder

Hans-Jorg Lieder of the Berlin State Library presents the Europeana Newspapers Project at our September 2013 workshop in Amsterdam.

In June, the project had a workshop on refinement, but it was now time to discuss aggregation and presentation. This workshop took place in Amsterdam on 16 September, during The European Library Annual Event. There was a good group of people, not only from the project partners and the associated partners, but also from outside the consortium. After the project, TEL hopes to be able to also offer these institutions a chance to send in their newspapers for Europeana, so we were very happy to have them join us.

The workshop kicked off with an introduction from Marieke Willems of LIBER and Hans-Joerg Lieder of the Berlin State Library.. They were followed by Markus Muhr from TEL, who introduced the aggregation plan and the schedule for the project partners. With so many partners, it can be quite difficult to find a schedule that works well, to ensure everyone has their material sent in on time. After the aggregation, TEL will then have to do some work on the metadata to convert it to the Europeana Data Model. Markus was followed by a presentation from Channa Veldhuijsen from the KB, who unfortunately, could not be there in person. However, her elaborate presentation on usability testing provided some good insights on how to get your website to be the best it can be and how to find out what your users really think when they are browsing your site.

It was then time for Alastair Dunning from TEL to showcase the portal that they have been preparing for Europeana Newspapers. Unfortunately, the wifi connection was not up to so many visitors and only some people could follow his presentation along on their own devices. However, there were some valuable feedback points which TEL will use to improve the portal. Unfortunately, the portal is not yet available from outside, so people who missed the presentation need to wait a bit longer to be able to see and browse the European newspapers.

But what we do already can see, are some websites of partners that have already been online for some time. It was very interesting to see the different choices each partner made to showcase their collection. We heard from people from the British Library, the National and University Library of Iceland, the National and University Library of Slovenia, the National Library of Luxembourg and the National Library of the Czech Republic.

P1100058

Yves Mauer from the National Library of Luxembourg presenting their newspaper portal

The day ended with a lovely presentation by Dean Birkett of Europeana, who, partly with Channa’s notes, went to all the previously presented websites and offered comments on how to improve them. The videos he used in his talk are available on Youtube. His key points were:

  1. Make the type size large: 16px is the recommended size.
  2. Be careful of colours. Some online newspapers sites use red to highlight important information but red is normally associated with warning signals and errors in the user’s mind.
  3. Use words to indicate language choices (eg. ‘english’, ‘français’) not flags. The Spanish flag won’t necessarily be interpreted to mean ‘click here for spanish’ if the user is from Mexico.
  4. Cut down on unnecessary text. Make it easy for users to skim (eg. though the use of bullet points).

All in all, it was a very useful afternoon in which I learned a lot about what users want from a website. If you want to see more, all presentations can be found at the Slideshare account of Europeana Newspapers or join us at one of the following events:

  • Workshop on Newspapers in Europe and the Digital Agenda. British Library, London. September 29-30th, 2014.
  • National Information Days.
    • National Library of Austria. March 25-26th, 2014.
    • National Library of France. April 3rd, 2014.
    • British Library. June 9th, 2014.

KB at DH2013

So, how do you summarise a 4-day conference with 159 papers, 52 posters, 13 workshops and 9 panels in one blogpost? You don’t… But I am going to try anyway!

I had the pleasure to attend and present a poster at the DH2013 conference this year, which took place two weeks ago in Lincoln, Nebraska. It was my first time attending the event and I was not disappointed! After a 14 hour trip (and a good night’s sleep) I started off my DH2013 experience with a wonderful workshop about Voyant, a web-based reading and analysis environment for digital text. All the material from the workshop is available online: http://hermeneuti.ca/workshop/dh13.

DH2013 logo

DH2013 logo

After some introductions of the people there, but also the tool, we formed groups to discuss how Voyant would be of use in our work. I was happy to see that we had quite a big group of librarians there, so of course we discussed how we could either show our own data to the users, but also how we can introduce the tool to students or professors for the university libraries. I’d love to see what our data looks like, so that’s a nice task for the coming months!

The main part of the conference started on my second day and I mainly spent my hours in the various short paper sessions that were held in the conference hotel. There were five papers in a sessions, each with a similar topic. And, being a text junkie, I listened to a lot of text analysis and stylometry, but I also found the time to visit papers on how to best serve researchers with tools, environments and other helpful equipment.

Some of my highlights included the paper of Anna Jobin and Fredric Kaplan on Google’s adwords lexicon, which not only consists of expensive and cheap words, but also of misspelled and non-existents words. The team at the DHLab of EPFL is undertaking a case study on the linguistic effects of autocompletion alghoritms and keyword bidding.

During the poster presentation

During the poster presentation

Another paper that stuck with me was that of our colleague from the British Library, Nora McGregor, who introduced their Digital Scholarship Training Programme. The BL has set up a training programme, consisting of 15 courses, all about anything digital. Their (not so digital?) curators can take a class on, for example, HTML, metadata formats, the BL’s digital collections or linked data. These low level courses can be taken by everyone in the library with an interest in the digital world.

Most of the people took about three courses, but there was also someone who completed the entire programme. Food for thought here at the KB! We have, what we call, kennis sessies (knowledge exchange sessions) and also offer some excellent courses in-house on copyright and digital preservation, but we have never looked at these as an entire course load. Perhaps we should!

KB and BL Poster

KB and BL Poster

And then the absolute top highlight of my DH2013 experience, our poster! The organisation arranged for a very nice and spacious room where we had our poster up on our own board, giving us the opportunity to have a big crowd of people around us. Luckily, this was also the case for us! I spoke to many people about our data (there actually ís an interest in Dutch data in the US!) and about what they would expect from a national library like ours or the BL. Want to leave your feedback as well? Please fill out our survey and help us improve!

So, with this blogpost I have NOT done justice to the conference at all, but have given you a very short overview of some of my wonderful experiences while in Nebraska. Please do read other posts to learn about the rest of the talks, such as the wonderful keynotes by David Ferriero, Willard McCarthy, and my favourite, Isabel Galina.

I met many very interesting people while there and was happy to find out I was accompanied by a lot of librarians. However, most of them were from university libraries. So, national librarians, we need you! Want to share experiences on how we work with digital humanists? Please do get in touch! And researchers? I just want to mention our survey once more.