This blog post is written by Thomas Smits, KB Researcher-in-residence from May 2017
One of the central and most far-reaching promises of the so-called Digital Humanities has been the possibility to analyse large datasets of cultural production, such as books, periodicals, and newspapers, in a quantitative way. Since the early 2000s, humanities 3.0, as Rens Bod has called it, was posited as being able to discover new patterns, mostly over long periods of time, that were overlooked by traditional qualitative approaches. In the last couple of weeks a study by a team of academics led by Professor Nello Christianini of the University of Bristol made headlines: “This AI found trends hidden in British history for more than 150 years” (Wired) and “What did Big Data find when it analysed 150 years of British history? (Phys.org). Did Big Data and Humanities 3.0 finally deliver on its promise? And could the KB’s collection of digitised newspapers be used for similar research?
At the end of December our current researcher-in-residence dr. Frank Harbers of Groningen University ended his project ‘Discerning Journalistic Styles’. In this blogpost he describes the outcomes and plans for the future.
It is January 2017, meaning my period as researcher-in-residence at the KB has come to an end. It also means that my project Discerning Journalistic Styles (DJS) has come to an end. It was a really nice and valuable experience and a fruitful project in which we (I couldn’t have done it without the expertise of KB programmer Juliette Lonij) have managed to create a classification tool that automatically determines the genre of news articles. You can try the tool yourself at: http://www.kbresearch.nl/genre. Just paste a Dutch news article in the text box, press the button below and the result will appear on the right side; simple as that!
In my previous blog post I addressed the detection of broken audio files in an automated workflow for ripping audio CDs. For (data) CD-ROMs and DVDs that are imaged to an ISO image, a similar problem exists: how can we be reasonably sure that the created image is complete? In this blog post I will discuss some possible ways of doing this using existing tools, along with their limitations. I then introduce Isolyzer, a new tool that might be a useful addition to the existing methods.
At the KB we have a large collection of offline optical media. Most of these are CD-ROMs, but we also have a sizeable proportion of audio CDs. We’re currently in the process of designing a workflow for stabilising the contents of these materials using disk imaging. For audio CDs this involves ‘ripping’ the tracks to audio files. Since the workflow will be automated to a high degree, basic quality checks on the created audio files are needed. In particular, we want to be sure that the created audio files are complete, as it is possible that some hardware failure during the ripping process could result in truncated or otherwise incomplete files.
To get a better idea of what software tool(s) are best suitable for this task, I created a small dataset of audio files which I deliberately damaged. I subsequently ran each of these files through a set of candidate tools, and then looked which tools were able to detect the faulty files. The first half of this blog post focuses on the WAVE format; the second half covers the FLAC format (at the moment we haven’t decided on which format to use yet).