Bridging the gap between quantitative and qualitative research in digital newspaper archives

This blog post is written by Thomas Smits, KB Researcher-in-residence from May 2017

One of the central and most far-reaching promises of the so-called Digital Humanities has been the possibility to analyse large datasets of cultural production, such as books, periodicals, and newspapers, in a quantitative way. Since the early 2000s, humanities 3.0, as Rens Bod has called it, was posited as being able to discover new patterns, mostly over long periods of time, that were overlooked by traditional qualitative approaches.[1] In the last couple of weeks a study by a team of academics led by Professor Nello Christianini of the University of Bristol made headlines: “This AI found trends hidden in British history for more than 150 years” (Wired) and “What did Big Data find when it analysed 150 years of British history? ( Did Big Data and Humanities 3.0 finally deliver on its promise? And could the KB’s collection of digitised newspapers be used for similar research?

Detecting broken ISO images: introducing Isolyzer

In my previous blog post I addressed the detection of broken audio files in an automated workflow for ripping audio CDs. For (data) CD-ROMs and DVDs that are imaged to an ISO image, a similar problem exists: how can we be reasonably sure that the created image is complete? In this blog post I will discuss some possible ways of doing this using existing tools, along with their limitations. I then introduce Isolyzer, a new tool that might be a useful addition to the existing methods.

Tackling problems and making progress

Our current Researcher-in-Residence, Frank Harbers, is well under way with his project “Discerning Journalistic Styles. Exploring Automated Analysis of Journalism’s Modes of Expression”. In this blogpost he gives an update on his project and its progress.

Frank Harbers

It has been several months since I wrote the first blog about my work as researcher-in-residence and the research project is in full swing by now. The first phase of the project , connecting the metadata from my own database to the historical newspaper data (and metadata) in Delpher is finished and we are fully enveloped in the main part of the project: training a classifier to automatically determine the genre of historical newspaper articles.

FAQ Call for Proposals Researcher-in-Residence 2017

 I don’t live or work in the Netherlands. Can I apply? 
Probably! Contact us at and we’ll discuss your options.

I want to use my own dataset. Is that possible?
Sure! As long as you also use one of the datasets of the KB and it doesn’t limit the publication of the project end results.

I don’t know how to code, is that a problem?
Not at all. We have skilled programmers who can help you with your project or we will try to find a match for you if you prefer someone else. This would mean submitting as a team and will cut the budget in half. Reach out to us to discuss the options.

KB at DHBenelux 2016

This week, the annual DHBenelux conference will take place in Belval, Luxembourg. It will bring together practically all DH scholars from Belgium (BE), the Netherlands (NE) and Luxembourg (LUX). You can read the full program and all abstracts on the website. Two presentations are by members of our DH team (Steven Claeyssens & Martijn Kleppe) and one presentation is by our current researcher in residence (Puck Wildschut – Radboud University Nijmegen). Please find the first paragraphs of their abstracts below:

Valid, but not accessible EPUB: crazy fixed layouts

EpubCheck is an invaluable tool for assessing the quality of EPUB files. Still, it is possible that EPUBs that are valid according to the format specification (and thus EpubCheck) are nevertheless inaccessible to some users. Some weeks ago a colleague sent me an EPUB 2 file that produced some really strange behaviour across a number of viewer applications. For a start, the text wouldn’t reflow properly after re-sizing the viewer window, and increasing the font size resulted in garbled text. Running the file through EpubCheck did return some validation errors, but none of these were related to the behaviour I was getting. Closer inspection revealed some very peculiar stylesheet and HTML use.

The future of EPUB? A first look at the EPUB 3.1 Editor’s draft


About a month ago the International Digital Publishing Forum, the standards body behind the EPUB format, published an Editor’s Draft of EPUB 3.1. This is meant to be the successor of the current 3.0.1 version. IDPC has set up a community review, which allows interested parties to comment on the draft. The proposed changes relative to EPUB 3.0.1 are summarised in this document. A note at the top states (emphasis added by me):

The EPUB working group has opted for a radical change approach to the addition and deletion of features in the 3.1 revision to move the standard aggressively forward with the overarching goals of alignment with the Open Web Platform and simplification of the core specifications.

As Gary McGath pointed out earlier, this is a pretty bold statement for what is essentially a minor version. The authors of the draft also mention that they expect it “will provoke strong reactions both for and against”, and that changes that raise “strong negative reactions” from the community “will be reviewed for future drafts”.

This blog post is an attempt to identify the main implications of the current draft for libraries and archives: to what degree would the proposed changes affect (long-term) accessibility? Since the current draft is particularly notable for its aggressive removal of various existing EPUB features, I will focus on these. These observations are all based on the 30 January 2016 draft of the changes document.

Digital Humanities at the KB

As promised, a blog about our poster at DHBenelux, but I didn’t want to simply publish the poster, so here is the explanation that goes with in. The abstract we submitted was a very general story about what happens in the KB with regards to Digital Humanities and the poster we developed out of this is one we hope you’ll see more often, because we love talking to you and promoting our stuff! But what was on this poster and what is actually happening with DH in the KB? In our strategic plan for 2015-2018 we refer to the Digital Humanists as the top layer of our user pyramid:

The top layer is formed by a relatively small, but growing group. They are researchers and developers who use the large textual data sets that the KB has built up with its partners during the past few years. More and more humanities researchers use tools to extract information and visualize data, to get a grip on data sets that can no longer be analyzed in the traditional way (big data). The KB actively supports this form of Humanities, Digital Humanities. (p. 10)


