KB Research

Research at the National Library of the Netherlands

Month: April 2013

Performance measurement – A state of the art review

Author: Henk Voorbij

The Handboek Informatiewetenschap (Handbook Information Science) contains about 150 contributions from experts on various subjects, such as budgeting, collection management, open access, website archiving, digitization of heritage collections, information retrieval software and gaming in the library. The Handbook is published as a loose-leaf paper version and as an online version (http://www.iwabase.nl/iwa-base/). Unfortunately, it is not well known.


Articles are being updated every ten years. Recently, I was requested to update the article on Performance Indicators, originally published in 2000 and authored by Peter te Boekhorst. My contribution starts with definitions of core concepts (for example: what is the difference between statistics and performance indicators). It continues with general guidelines that may be helpful to libraries which aim to develop their own instrument for performance measurement. Among these are models such as the classical system approach (input – throughput – output – outcomes) and the Balanced Scorecard, and international standards, such as provided by ISO and IFLA. In the Netherlands, there are well developed benchmark instruments for university libraries and libraries from universities of applied sciences. I am involved with the development of these systems and analysis of the data since many years and describe my experiences in depth, in order to provide an example of the caveats and benefits of performance measurement. The last three chapters address potential additions to traditional performance indicators: user surveys, web statistics and outcomes.

Updating an earlier version offers an excellent opportunity to depict the progress in the field. Two things struck me most. One is the fast rise of the concept ‘Key Performance Indicators’. There’s no agreement in the literature of what this concepts actually means. Some use it loosely and do not make a genuine distinction between performance indicators and key performance indicators. Others have very pronounced ideas of its meaning: there should be no more than ten KPI’s, they should be measured daily or weekly, they should be coupled to critical success factors rather than strategic goals, and they should be understood by a fourteen year old child. The other thing is the growing interest in outcomes, the popularity of LibQual+ as an instrument to measure user satisfaction and the upsurge of new technologies such as web analytics. I can’t wait to see the 2023 version of the paper.

“Retracted”, so: no longer accessible?

Author: Barbara Sierman
Originally posted on: http://digitalpreservation.nl/seeds/retracted-so-no-longer-accessible/

Some time ago I blogged about a fraud case in the Netherlands. The author was a well known expert in his field and published in various scientific journals. Many of his articles turned out to be based on fraudulent data and should not have been published. I briefly described the existing policies of publishers to retract information from their databases in these kind of cases.

Stapel in Science Direct

Stapel in Science Direct










Well, this has now happened: the Science Direct database no longer shows a number of articles of Diederik Stapel. Instead you’re warned that the article is retracted and the reason why. If you request the article itself, you will only see part of the first page.









What Elsevier did is in line with their policy. But there is another side of the coin. Other scientists based publications on the insights that Stapel described in his articles, and cited from these articles. These citations can no longer be checked via Science Direct.  If, say in 20 years time, someone wants to investigate what was all the fuss about in 2012 and to study the scientific publications of Stapel, he/she will not find the original articles in Science Direct, only perhaps the censored versions. It is not likely that his own university repository did preserve the original digital article, as they only have a subscription to the Science Direct e-journal, and do not own a digital copy.

This is exactly one of the reasons why some major players are collecting the “world” digital scientific output. Organizations like LOCKSS, CLOCKS, Portico and the International e-Depot of the National Library of the Netherlands all have the mission to preserve these e-journals and their articles. In these collections one should be able to find the original articles. They will have a policy of not to delete articles once they acquired them for long term preservation. The future researcher/detective should go to one of these repositories for his/her investigation.

KB joins the leading Big Data conference in Europe!

hadoopsummitOn March 20-21, Hadoop Summit 2013, the leading big data conference, made its first ever appearance on European soil. The Beurs van Berlage in Amsterdam provided a splendid venue for the gathering of about 500 international participants interested in the newest trends around Big Data and Hadoop. The main hosts Hortonworks and Yahoo did an excellent job in putting together an exciting programme with two days full of enticing sessions divided by four distinct tracks: Applied Hadoop, Operating Hadoop, Hadoop Futures and Integrating Hadoop.

audienceHadoop Summit 2013, © http://www.flickr.com/photos/timoelliott/

The open-source Hadoop software framework allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale out from single servers to thousands of machines.

In his keynote, Hortonworks VP Shaun Connolly’s pointed out that already more than half the world’s data will be processed using Hadoop in 2015! Further on, there were keynotes by 451 Research Director Matt Aslett (What is the point of Hadoop?), Hortonworks founder and CEO Eric Baldeschwieler (Hadoop Now, Next and Beyond) and a live panel that discussed Real-World insight into Hadoop in the Enterprise.

vendorsVendor area at Hadoop Summit 2013, © http://www.flickr.com/photos/timoelliott/

Many interesting talks followed on the use and benefit derived from Hadoop at companies like Facebook, Twitter, Ebay, LinkedIn and alike, as well as on exciting upcoming technologies further enriching the Hadoop ecosystem such as Apache projects Drill, Ambari or the next-generation MapReduce implementation YARN.

The Koninklijke Bibliotheek and the Austrian National Library jointly presented their recent experiences with Hadoop in the SCAPE project. Clemens Neudecker and Sven Schlarb spoke about the potential of integrating Hadoop into digital libraries in their talk “The Elephant in the Library” (video: coming soon).

[slideshare id=18002904&doc=neudeckerschlarb20march1550administratiezaalv2-130401120136-phpapp01]

In the SCAPE project partners are experimenting with integrating Hadoop into library workflows for different large-scale data processing scenarios related to web archiving, file format migration or analytics – you can find out more about the Hadoop related activities in SCAPE here: 

After two very successful days the Hadoop Summit concluded and participants agreed there needs to be another one next year – likely again to be held in the amazing city of Amsterdam!

Find out more about Hadoop Summit 2013 in Amsterdam:

Web:             http://hadoopsummit.org/amsterdam/
Facebook:    https://www.facebook.com/HadoopSummit
Pictures:      http://www.flickr.com/photos/timoelliott/
Tweets:       https://twitter.com/search/?q=hadoopsummit
Slides:          http://www.slideshare.net/Hadoop_Summit/
Videos:        http://www.youtube.com/user/HadoopSummit/videos
Blogs:           http://hortonworks.com/blog/hadoop-summit-2013-amsterdam-its-a-wrap/

IMPACT across the pond


Large amounts of historical books and documents are continuously being brought online through the many mass digitisation projects in libraries, museums and archives around the globe. While the availability of digital facsimiles already made these historical collections much more accessible, the key to unlock their full potential for scholarly research is making these documents fully searchable and editable – and this is still a largely problematic process.

During 2007 – 2012 the Koninklijke Bibliotheek coordinated the large-scale integrating project IMPACT – Improving Access to Text that explored different approaches to innovate OCR technology and significantly lowered the barriers that stand in the way of the mass digitisation of the European cultural heritage. The project concluded in June 2012 and led to the conception of the impact Centre of Competence in Digitisation.


Texas A&M University campus, home of the “Aggies”

The Early Modern OCR Project (eMOP) is a new project established by the Initiative for Digital Humanities, Media and Culture at Texas A&M University with funding from the Andrew W. Mellon Foundation that will run from October 2012 through September 2014. The eMOP project draws upon the experiences and solutions from IMPACT to create technical resources for improving OCR for early modern English texts from Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) in order to make them available to scholars through the Advanced Research Consortium (ARC). The integration of post-correction and collation tools will enable scholars of the early modern period to exploit the more than 300,000 documents to their full potential. Already now the eMOP Zotero library is the place to find anything you ever wanted to know about OCR and  related technologies.


eMOP is using the Aletheia tool from IMPACT partner PRImA to create ground truth for  the historical texts

MELCamp 2013 now provided a good opportunity to gather some of the technical collaborators on the eMOP project, like Clemens Neudecker from the Koninklijke Bibliotheek and Nick Laiacona from Performant Software for a meeting in College Station, Texas with the eMOP team at the IDHMC. Over the course of 25 – 28 March lively discussions evolved around finding the ideal setup for training the open-source OCR engine Tesseract to recognise English from the early modern period, fixing line segmentation in Gamera (thanks to Bruce Robertson), the creation of word frequency lists for historical English, and the question of how to combine all the various processing steps in a simple to use workflow using the Taverna workflow system.

A tour of Cushing Memorial Library and Archives with its rich collection of early prints and the official repository for George R.R. Martin’s writings wrapped up a nice and inspiring week in sunny Texas – to be continued!

Find out more about the Early Modern OCR project:

Web:                http://emop.tamu.edu/
Wiki:                http://emopwiki.tamu.edu/index.php/Main_Page
Video:              http://idhmc.tamu.edu/projects/Mellon/why.html
Blog:                http://emop.tamu.edu/blog

© 2018 KB Research

Theme by Anders NorenUp ↑