KB Research

Research at the National Library of the Netherlands

Month: April 2014

‘We learn so much from each other’ – Hildelies Balk about the Digitisation Days (19-20 May)

The Digitisation Days will take place in Madrid on 19-20 May. What can you expect from them and why should you go? In order to get answers to these questions we interviewed Hildelies Balk of the National Library of the Netherlands (KB), who is also a member of the executive board of the organizing insitution, the IMPACT Centre of Competence (IMPACT CoC). – Interview and photo by Inge Angevaare (see below for Dutch version)

Hildelies Balk Reading room National Library

Hildelies Balk in the National Library’s Reading Rooms

The Digitisation Days will be of interest to …?

‘Anyone who is working with digitised historical texts. These are often difficult to use because the software cannot decipher damaged originals or illegible characters. For example:

example OCR historical text

‘The software used to ‘read’ this (Dutch) text produces the following result:

VVt Venetien den 1.Junij, Anno 1618.
DJgn i f paffato te S’ aö’Jifeert mo?üen/bah
.)etgi’uotbciraetail)i.r/JtmelchontDecht
te / sbnbe bele btr felbrr geiufttceert baer bnber
eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu
enbeeemgljen bifet Cbeiiupcen berbonbru befe

‘The Dutch National Library and many other libraries are striving to make these types of historical text more usable to researchers by enhancing the quality of the OCR (optical character recognition). Since 2008, we have been involved in European projects set up to improve the usability of OCR’d texts – preferably automatically. The IMPACT Centre of Competence as well as the Digitisation Days are quite unique in that they bring together three interest groups:

  • institutions with digitised collections (libraries, archives, museums)
  • researchers working on means to improve access to digitised text (image recognition, pattern recognition, language technology)
  • companies providing products and services in the field of digitisation and OCR.

‘Representatives of all of these groups will be taking part in the Digitisation Days and they offer participants a complete overview of the state of the art in document analysis, language technology and post-correction of OCR.’

What are the most important benefits from the Centre of Competence and the Digitisation Days, in your opinion?

‘The IMPACT Centre of Competence assists heritage institutions in taking important decisions. We evaluate available tools and report about them. Evaluation software of good quality is available as well. We also provide institutions with guidance and advice in digitisation issues by answering questions such as: what would be the best tools and methods for this particular institution? What quality can you expect from a solution? And what will it cost?’

‘The Digitisation Days offer a perfect opportunity for heritage institutions to get together and share experience and knowledge on issues such as: how to embed digitisation in your institution? How to deal with providers? Also: how do we start up new projects? Where do we find funding? On the second day, those who are interested are invited to join a workshop on the topic of the research agenda for digitisation. What should be the focus for the coming years? Should we focus on quantity or quality? How can we help shapeEuropean plans and budgets?’

Now that you mention Europe: IMPACT, IMPACT Centre of Competence, SUCCEED – the announcement of the Digitisation Days is packed with acronyms. Can you give us a bit of help here??

‘IMPACT was the first European research project aimed at improving access to historical texts. It started in 2008, at the initiative of, among others, the Dutch KB. When the project ended, a number of IMPACT partners set up the IMPACT Centre of Competence to ensure that the project results would be supported and developed. The Centre is not a project, but a standing organisation.’

Succeed is another European project, and, by definition, temporary. The objectives are in line with the IMPACT CoC, and the project involves some of the same partners. The aim is raise awareness about the results of European projects related to the digital library and to stimulate implementation. Before the CoC, it was not uncommon for prototypes to be left as they were after completion of a project. Thus the investments did not pay off.’

Will you really turn theory into practice?

‘Yes, most definitely! It is our prime focus for the conference. This is why we instituted the Succeed awards which will be handed out during the Digitisation Days; the Succeed awards recognise the best implementations of innovative technologies. The board has recently announced the winners.’

What do you personally look forward to most during the Digitisation Days?

‘To meeting everybody, to bringing together all these different parties. Colleagues from other institutions, researchers – this is exactly the right kind of meeting for generating exciting ideas and solutions.’

‘We kunnen zoveel van elkaar leren’ – Hildelies Balk over de Digitisation Days (19-20 mei)

Op 19-20 mei worden in Madrid de Digitisation Days gehouden. Wat valt er te beleven en waarom zou je erheen gaan? We vroegen het Hildelies Balk van de Koninklijke Bibliotheek, die voorzitter is van het bestuur van de organisator, het IMPACT Centre of Competence (IMPACT CoC). – interview en foto Inge Angevaare

Hildelies Balk leeszaal KB

Hildelies Balk in de leeszalen van de KB

Voor wie zijn de Digitisation Days interessant?

‘Voor iedereen die te maken heeft met gedigitaliseerde, historische teksten. Die zijn vaak moeilijk bruikbaar omdat de leessoftware veel fouten maakt. Dat komt bij voorbeeld omdat het originele drukwerk zelf al slecht was, of omdat de drukletter slecht leesbaar is:

voorbeeld OCR historische tekst

‘De software die de plaatjes moet omzetten in leesbare tekst maakt daarvan:

VVt Venetien den 1.Junij, Anno 1618.
DJgn i f paffato te S’ aö’Jifeert mo?üen/bah
.)etgi’uotbciraetail)i.r/JtmelchontDecht
te / sbnbe bele btr felbrr geiufttceert baer bnber
eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu
enbeeemgljen bifet Cbeiiupcen berbonbru befe

‘De KB en andere bibliotheken willen dit soort teksten in bruikbare vorm aanbieden aan wetenschappers. Dus zoeken we al sinds 2008 in Europees verband naar methoden om de teksten te verbeteren, liefst automatisch. Het unieke aan het IMPACT Centre of Competence én van de Digitisation Days is dat daar drie belangengroepen bij elkaar komen die elkaar versterken:

  • instellingen met collecties die gedigitaliseerd zijn (bibliotheken, archieven, musea)
  • onderzoekers die methoden ontwikkelen om gedigitaliseerde tekst te verbeteren (beeldherkenning en – verbetering, patroonherkenning, taaltechnologie)
  • leveranciers van producten en diensten voor digitalisering en OCR (optical character recognition).

‘Door de aanwezigheid van al deze mensen krijgt de bezoeker in twee dagen tijd een compleet overzicht van wat er momenteel allemaal mogelijk is – op het gebied van documentanalyse, taaltechnologie en post-correctie van OCR.’

Wat zie jij als het grootste nut van het Centre of Competence en de Digitisation Days?

‘Het IMPACT Centre of Competence helpt erfgoedinstellingen belangrijke beslissingen te nemen. We evalueren bestaande tools en publiceren daarover. Er is zelfs heel goede evaluatiesoftware. En we leveren begeleiding; als een instelling wil gaan digitaliseren kunnen wij ze van advies dienen. Wat zijn de beste tools en methoden in hun specifieke geval? Wat voor kwaliteit mag je verwachten? Wat gaat het kosten?’

‘De Digitisation Days zijn een perfecte manier voor erfgoedinstellingen om elkaar te ontmoeten, uitgebreid ervaringen en kennis te delen. Bijvoorbeeld: Hoe ga je om met leveranciers? Hoe geef je digitalisering een plek in je organisatie? Maar ook: hoe zetten we nieuwe projecten op? Hoe vinden we geldstromen? Op de tweede dag is er een workshop waarin we met belangstellenden gaan praten over de onderzoeksagenda voor digitalisering. Waar moeten we de nadruk op leggen? Meer kwantiteit of meer kwaliteit? Hoe kunnen we de plannen en budgetten van Europa beïnvloeden?’

Nu je het over Europa hebt: IMPACT, IMPACT Centre of Competence, SUCCEED – de aankondiging van de Digitisation Days staat vol met afkortingen. Kun je een beetje orde scheppen in die chaos?

‘IMPACT was het eerste Europese onderzoeksproject voor verbetering van toegang tot historische teksten dat mede op initiatief van de KB in 2008 is gestart. Toen het project afgelopen was, hebben een aantal IMPACT-partners de handen ineengeslagen om ervoor te zorgen dat de resultaten van het project onderhouden en verder ontwikkeld zouden worden. Dat is het IMPACT Centre of Competence. Geen project, maar een staande organisatie.’

Succeed is weer een Europees project en dus tijdelijk. De doelstellingen liggen helemaal in lijn met het IMPACT CoC, en daarom zijn er deels dezelfde partners bij betrokken. Doel is om te zorgen dat eindresultaten van Europese projecten op het gebied van de digitale bibliotheek goed onder de aandacht worden gebracht zodat ze gebruikt gaan worden in de praktijk. In het verleden bleven prototypes nog wel eens op de plank liggen. Dat is zonde van de investering.’

Wordt de stap van theorie naar praktijk echt gezet?

‘Jazeker! Die willen we juist alle aandacht geven. Daarom reiken we tijdens de Digitisation Days de Succeed awards uit – prijzen voor de beste toepassingen van innovatieve oplossingen. De jury heeft onlangs de kandidaten en de winnaars bekend gemaakt.’

Waar verheug jijzelf je het meest op tijdens de Digitisation Days?

‘Op de ontmoeting, het bij elkaar brengen van al die belanghebbenden. Collega’s van andere instellingen, de onderzoekers – juist uit de ontmoeting komen vaak spannende ideeën en oplossingen voort.’

Working together to improve text digitisation techniques

2nd Succeed hackathon at the University of Alicante

https://www.flickr.com/photos/116354723@N02/13757270124/

Is there any one still out there who thinks a hackathon is a malicious break-in? Far from it. It is the best way for developers and researchers to get together and work on new tools and innovations. The 2nd developers workshop / hackathon organised on 10-11 April by the Succeed Project was a case in point: bringing together people to work on new ideas and new inspiration for better OCR. The event was held in the “Claude Shannon” aula of the Department of Software and Computing Systems (DLSI) of the University of Alicante, Spain. Claude Shannon was a famous mathematician and engineer and is also known as the “father of information theory”. So it seems like a good place to have a hackathon!

Clemens explains what a hackathon is and what we hope to achieve with it for Succeed.

Same as last year, we again provided a wiki upfront with some information about possible topics to work on, as well as a number of tools and data that participants could experiment with before and during the event. Unfortunately there was an unexpectedly high number of no-shows this time – we try to keep these events free and open to everyone, but may have to think about charging at least a no-show fee in the future, as places are usually limited. Or did those hackers simply have to stay home to fix the heartbleed bug on their servers? We will probably never find out.

Collaboration, open source tools, open solutions

Nevertheless, there was a large enough group of programmers and researchers from Germany, Poland, the Netherlands and various parts of Spain eager to immerse themselves deeply into a diverse list of topics. Already in the introduction we agreed to work on open tools and solutions, and quickly identified some areas in which open source tool support for text digitisation is still lacking (see below). Actually, one of the first things we did was to set up a local git repository, and people were pushing code samples, prototypes and interesting projects to share with the group during both days.

https://www.flickr.com/photos/116354723@N02/13775777003/

What’s the status of open source OCR?

Accordingly, Jesús Dominguez Muriel from Digibís (the company that also made http://www.digibis.com/dpla-europeana/)  started an investigation into open source OCR tools and frameworks. He made a really detailed analysis of the status of open source OCR, which you can find here. Thanks a lot for that summary, Jesús! At the end of his presentation, Jesús also suggested an “algorithm wikipedia” – I guess something similar to RosettaCode but then specifically for OCR. This would indeed be very useful to share algorithms but also implementations and prevent reinventing (or reimplementing) the wheel. Something for our new OCRpedia, perhaps?

A method for assessing OCR quality based on ngrams

As turned out on the second day, a very promising idea seemed to be using ngrams for assessing the quality of an OCR’ed text, without the need for ground truth. Well, in fact you do still need some correct text to create the ngram model, but one can use texts from e.g. Project Gutenberg or aspell for that. Two groups started to work on this: while Willem Jan Faber from the KB experimented with a simple Python script for that purpose, the group of Rafael Carrasco, Sebastian Kirch and Tomasz Parkola decided to implement this as a new feature in the Java ocrevalUAtion tool (check the work-in-progress “wip” branch).

https://www.flickr.com/photos/116354723@N02/13775774723/

Jesús in the front, Rafael, Sebastian and Tomasz discussing ngrams in the back.

Aligning text and segmentation results

Another very promising development was started by Antonio Corbi from the University of Alicante. He worked on a software to align plain text and segmentation results. The idea is to first identify all the lines in a document, segment them into words and eventually individual charcaters, and then align the character outlines with the text in the ground truth. This would allow (among other things) creating a large corpus of training material for an OCR classifier based on the more than 50,000 images with ground truth produced in the IMPACT Project, for which correct text is available, but segmentation could only be done on the level of regions. Another great feature of Antonio’s tool is that while he uses D as a programming language, he also makes use of GTK, which has the nice effect that his tool does not only work on the desktop, but also as a web application in a browser.

aligner

OCR is complicated, but don’t worry – we’re on it!

Gustavo Candela works for the Biblioteca Virtual Miguel de Cervantes, the largest Digital Library in the Spanish speaking world. Usually he is busy with Linked Data and things like FRBR, so he was happy to expand his knowledge and learn about the various processes involved in OCR and what tools and standards are commonly used. His findings: there is a lot more complexity involved in OCR than appears at first sight. And again, for some problems it would be good to have more open source tool support.

In fact, at the same time as the hackathon, at the KB in The Hague, the ‘Mining Digital Repositories‘ conference was going on where the problem of bad OCR was discussed from a scholarly perspective. And also there, the need for more open technologies and methods was apparent:

[tweet 454528200572682241 hide_thread=’true’]

Open source border detection

One of the many technologies for text digitisation that are available in the IMPACT Centre of Competence for image pre-processing is Border Removal. This technique is typically applied to remove black borders in a digital image that have been captured while scanning a document. The borders don’t contain any information, yet they take up expensive storage space, so removing the borders without removing any other relevant information from a scanned document page is a desirable thing to do. However, there is no simple open source tool or implementation for doing that at the moment. So Daniel Torregrosa from the University of Alicante started to research the topic. After some quick experiments with tools like imagemagick and unpaper, he eventually decided to work on his own algorithm. You can find the source here. Besides, he probably earns the award for the best slide in a presentation…showing us two black pixels on a white background!

A great venue

All in all, I think we can really be quite happy with these results. And indeed the University of Alicante also did a great job hosting us – there was an excellent internet connection available via cable and wifi, plenty of space and tables to discuss in groups and we were distant enough from the classrooms not to be disturbed by the students or vice versa. Also at any time there was excellent and light Spanish food – Gazpacho, Couscous with vegetables, assorted Montaditos, fresh fruit…nowadays you won’t make hackers happy with just pizza anymore! Of course there were also ice-cooled drinks and hot coffee, and rumours spread that there were also some (alcohol-free?) beers in the cooler, but (un)fortunately there is no more documentary evidence of that…

To be continued!

If you want to try out any of the software yourself, just visit our github and have go! Make sure to also take a look at the videos that were made with participants Jesús, Sebastian and Tomasz, explaining their intentions and expectations for the hackathon. And at the next hackathon, maybe we can welcome you too amongst the participants?

How to maximise usage of digital collections

Libraries want to understand the researchers who use their digital collections and researchers want to understand the nature of these collections better. The seminar ‘Mining digital repositories’ brought them together at the Dutch Koninklijke Bibliotheek (KB) on 10-11 April, 2014, to discuss both the good and the bad of working with digitised collections – especially newspapers. And to look ahead at what a ‘digital utopia’ might look like. One easy point to agree on: it would be a world with less restrictive copyright laws. And a world where digital ‘portals’ are transformed into ‘platforms’ where researchers can freely ‘tinker’ with the digital data. – Report & photographs by Inge Angevaare, KB.

Mining Digital Repositories Conference 2014

Hans-Jorg Lieder of the Berlin State Library (front left) is given an especially warm welcome by conference chair Toine Pieters (Utrecht), ‘because he was the only guy in Germany who would share his data with us in the Biland project.’

Libraries and researchers: a changing relationship

‘A lot has changed in recent years,’ Arjan van Hessen of the University of Twente and the CLARIN project told me. ‘Ten years ago someone might have suggested that perhaps we should talk to the KB. Now we are practically in bed together.’

But each relationship has its difficult moments. Researchers are not happy when they discover gaps in the data on offer, such as missing issues or volumes of newspapers. Or incomprehensible transcriptions of texts because of inadequate OCR (optical character recognition). Conference organisers Toine Pieters and Jaap Verheul (University of Utrecht) invited Hans-Jorg Lieder of the Berlin State Library to explain why he ‘could not give researchers everything everywhere today’.

Lieder & Thomas: ‘Digitising newspapers is difficult’

Both Deborah Thomas of the Library of Congress and Hans-Jorg Lieder stressed how complicated it is to digitise historical newspapers. ‘OCR does not recognise the layout in columns, or the “continued on page 5”. Plus the originals are often in a bad state – brittle and sometimes torn paper, or they are bound in such a way that text is lost in the middle. And there are all these different fonts, e.g., Gothic script in German, and the well-known long-s/f confusion.’ Lieder provided the ultimate proof of how difficult digitising newspapers is: ‘Google only digitises books, they don’t touch newspapers.’

Mining Digital Repositories Damaged Newspapers

Thomas: ‘The stuff we are digitising is often damaged’

Another thing researchers should be aware of: ‘Texts are liquid things. Libraries enrich and annotate texts, versions may differ.’ Libraries do their best to connect and cluster collections of newspapers (e.g., in the Europeana Newspapers), but ‘the truth of the matter is that most newspapers collections are still analogue; at this moment we have only bits and pieces in digital form, and there is a lot of bad OCR.’ There is no question that libraries are working on improving the situation, but funding is always a problem. And the choices to be made with bad OCR are sometimes difficult: Should we manually correct it all, or maybe retype it, or maybe even wait a couple of years for OCR technology to improve?’

Mining Digital Repositories Conference Claeyssens Van Hessen Kenter

Librarians and researchers discuss what is possible and what not. From the left, Steven Claeyssens, KB Data Services, Arjan van Hessen, CLARIN, and Tom Kenter, Translantis.

Researchers: how to mine for meaning

Researchers themselves are debating how they can fit these new digital resources into their academic work. Obviously, being able to search millions of newspaper pages from different countries in a matter of days opens up a lot of new research possibilities. Conference organisers Toine Pieters and Jaap Verheul (University of Utrecht) are both involved in the HERA Translantis project which is taking a break from traditional ‘national’ historical research by looking at transnational influences of so-called ‘reference cultures’:

Mining digital repositories - Definition of reference cultures

Definition of Reference Cultures in the Translantis project which mines digital newspaper collections

In the 17th century the Dutch Republic was such a reference culture. In the 20th century the United States developed into a reference culture and Translantis digs deep into the digital newspaper archives of the Netherlands, the UK, Belgium and Germany to try and find out how the United States is depicted in public discourse:

Mining Digital Repositories Jaap Verheul Translantis

Jaap Verheul (Translantis) shows how the US is depicted in Dutch newspapers

Joris van Eijnatten introduced another transnational HERA project, ASYMENC, which is exploring cultural aspects of European identity with digital humanities methodologies.

All of this sounds straightforward enough, but researchers themselves have yet to develop a scholarly culture around the new resources:

  • What type of research questions do the digital collections allow? Are these new questions or just old questions to be researched in a new way?
  • What is scientific ‘proof’ if the collections you mine have big gaps and faulty OCR?
  • How to interpret the findings? You can search words and combinations of words in digital repositories, but how can you assess what the words mean? Meanings change over time. Also: how can you distinguish between irony and seriousness?
  • How do you know that a repository is trustworthy?
  • How to deal with language barriers in transnational research? Mere translations of concepts do not reflect the sentiment behind the words.
  • How can we analyse what newspapers do not discuss (also known as the ‘Voldemort’ phenomenon)?
  • How sustainable is digital content? Long-term storage of digital objects is uncertain and expensive. (Microfilms are much easier to keep, but then again, they do not allow for text mining …)
  • How do available tools influence research questions?
  • Researchers need a better understanding of text mining per se.

Some humanities scholars have yet to be convinced of the need to go digital

Rens Bod, Director of the Dutch Centre for Digital Humanities enthusiastically presented his ideas about the value of parsing (analysing parts of speech) for uncovering deep patterns in digital repositories. If you want to know more: Bod recently published a book about it.

Rens Bod

Professor Rens Bod: ‘At the University of Amsterdam we offer a free course in working with digital data.’

But in the context of this blog, his remarks about the lack of big data awareness and competencies among many humanities scholars, including young students, was perhaps more striking. The University of Amsterdam offers a crash course in working with digital data to bridge the gap. The one-week, free course, deals with all aspects of working with data, from ‘gathering data’ to ‘cooking data’.

As the scholarly dimensions of working with big data are not this blogger’s expertise, I will not delve into these further but gladly refer you to an article Toine Pieters and Jaap Verheul are writing about the scholarly outcomes of the conference [I will insert a link when it becomes available].

Mining Digital Repositories Jaap Verheul Toine Pieters

Conference hosts Jaap Verheul (left) and Toine Pieters taking analogue notes for their article on Mining Digital Repositories. And just in case you wonder: the meeting rooms are probably the last rooms in the KB to be migrated to Windows 7

More data providers: the ‘bad’ guys in the room

It was the commercial data providers in the room themselves that spoke of ‘bad guys’ or ‘bogey man’ – an image both Ray Abruzzi of Cengage Learning/Gale and Elaine Collins of DC Thomson Family History were hoping to at least soften a bit. Both companies provide huge quantities of digitised material. And, yes, they are in it for the money, which would account for their bogeyman image. But, they both stressed, everybody benefits from their efforts:

Value proposition of DC Thomson Family History

Value proposition of DC Thomson Family History

Cengage Learning is putting 25-30 million pages online annually. Thomson is digitising 750 million (!) newspaper & periodical pages for the British Library. Collins: ‘We take the risk, we do all the work, in exchange for certain rights.’ If you want to access the archive, you have to pay.

In and of itself, this is quite understandable. Public funding just doesn’t cut it when you are talking billions of pages. Both the KB’s Hans Jansen and Rens Bod (U. of Amsterdam) stressed the need for public/private partnerships in digitisation projects.

And yet.

Elaine Collins readily admitted that researchers ‘are not our most lucrative stakeholders’; that most of Thomson’s revenue comes from genealogists and the general public. So why not give digital humanities scholars free access to their resources for research purposes, if need be under the strictest conditions that the information does not go anywhere else? Both Abruzzi and Collins admitted that such restricted access is difficult to organise. ‘And once the data are out there, our entire investment is gone.’

Libraries to mediate access?

Perhaps, Ray Abruzzi allowed, access to certain types of data, e.g., metadata, could be allowed under certain conditions, but, he stressed, individual scholars who apply to Cengage for access do not stand a chance. Their requests for data are far too varied for Cengage to have any kind of business proposition. And there is the trust issue. Abruzzi recommended that researchers turn to libraries to mediate access to certain content. If libraries give certain guarantees, then perhaps …

Mining Digital Repositories Toine Pieters

You think OCR is difficult to read? Try human handwriting!

What do researchers want from libraries?

More data, of course, including more contemporary data (… ah, but copyright …)

And better quality OCR, please.

What if libraries have to choose between quality and quantity?  That is when things get tricky, because the answer would depend on the researcher you question. Some may choose quantity, others quality.

Should libraries build tools for analysing content? The researchers in the room seemed to agree that libraries should concentrate on data rather than tools. Tools are very temporary, and researchers often need to build the tools around their specific research questions.

But it would be nice if libraries started allowing users to upload enrichments to the content, such as better OCR transcriptions and/or metadata.

Mining Digital Repositories 2014

Researchers and libraries discussing what is desirable and what is possible. In the front row, from the left, Irene Haslinger (KB), Julia Noordegraaf (U. of Amsterdam), Toine Pieters (Utrecht), Hans Jansen (KB); further down the front row James Baker (British Library) and Ulrich Tiedau (UCL). Behind the table Jaap Verheul (Utrecht) and Deborah Thomas (Library of Congress).

And there is one more urgent request: that libraries become more transparent in what is in their collections and what is not. And be more open about the quality of the OCR in the collections. Take, e.g., the new Dutch national search service Delpher. A great project, but scholars must know exactly what’s in it and what’s not for their findings to have any meaning. And for scientific validity they must be able to reconstruct such information in retrospect. So a full historical overview of what is being added at what time would be a valuable addition to Delpher. (I shall personally communicate this request to the Delpher people, who are, I may add, working very hard to implement user requests).

American newspapers

Deborah Thomas of the US Library of Congress: ‘This digital age is a bit like the American Wild West. It is a frontier with lots of opportunities and hopes for striking it rich. And maybe it is a bit unruly.’

New to the library: labs for researchers

Deborah Thomas of the Library of Congress made no bones about her organisation’s strategy towards researchers: We put out the content, and you do with it whatever you want. In addition to API’s (Application Protocol Interfaces), the Library is also allowing for downloads of bulk content. The basic content is available free of charge, but additional metadata levels may come at a price.

The British Library (BL) is taking a more active approach. The BL’s James Baker explained how the BL is trying to bridge the gap between researchers and content by providing special labs for researchers. As I (unfortunately!) missed that parallel session, let me mention the KB’s own efforts to set up a KB lab where researchers are invited to experiment with KB data making use of open source tools. The lab is still in its ‘pre-beta phase’ as Hildelies Balk of the KB explained. If you want the full story, by all means attend the Digital Humanities Benelux Conference in the Hague on 12-13 June, where Steven Claeyssens and Clemens Neudecker of the KB are scheduled to launch the beta-version of the platform. Here is a sneak preview of the lab, a scansion machine built by KB Data Services in collaboration with phonologist Marc van Oostendorp (audio in Dutch):

https://www.youtube.com/watch?v=FcTufco9P3A

Europeana: the aggregator

“Portals are for visiting; platforms are for building on.”

Another effort by libraries to facilitate transnational research is the aggregation of their content in Europeana, especially Europeana Newspapers. For the time being the metadata are being aggregated, but in Alistair Dunning‘s vision, Europeana will grow from an end-user portal into a data brain, a cloud platform that will include the content and allow for metadata enrichment:

Alistair Dunning: 'Europeana must grow into

Alistair Dunning: ‘Europeana must grow into a data brain to bring disparate data sets together.’

Dunning's vision of Europeana in the future

Dunning’s vision of Europeana 3.0

Dunning also indicated that Europeana might develop brokerage services to clear content for non-commercial purposes. In a recent interview Toine Pieters said that researchers would welcome Europeana to take such a role, ‘because individual researchers should not be bothered with all these access/copyright issues.’ In the United States, the Library of Congress is not contemplating a move in that direction, Deborah Thomas told her audience. ‘It is not our mission to negotiate with publishers.’ And recent ‘Mickey Mouse’ legislation, said to have been inspired by Disney interests, seems to be leading to less rather than more access.

Dreaming of digital utopias

What would a digital utopia look like for the conference attendees? Jaap Verheul invited his guests to dream of what they would do if they were granted, say, €100 million to spend as they pleased.

Deborah Thomas of the Library of Congress would put her money into partnerships with commercial companies to digitise more material, especially the post-1922 stuff (less restrictive copyright laws being part and parcel of the dream). And she would build facilities for uploading enrichments to the data.

James Baker of the British Library would put his money into the labs for researchers.

Researcher Julia Noordegraaf of the University of Amsterdam (heritage and digital culture) would rather put the money towards improving OCR quality.

Joris van Eijnatten’s dream took the Europeana plans a few steps further. His dream would be of a ‘Globiana 5.0’ – a worldwide, transnational repository filled with material in standardised formats, connected to bilingual and multilingual dictionaries and researched by a network of multilingual, big data-savvy researchers. In this context, he suggested that ‘Google-like companies might not be such a bad thing’ in terms of sustainability and standardisation.

Joris van Eijnatten

Joris van Eijnatten: ‘Perhaps – and this is a personal observation – Google-like companies are not such a bad thing after all in terms of sustainability and standardisation of formats.’

At the end of the two-day workshop, perhaps not all of the ambitious agenda had been covered. But, then again, nobody had expected that.

Agenda for Mining Digital Repositories 2014

Mining Digital Repositories 2014 – the ambitious agenda

The trick is for providers and researchers to keep talking and conquer this ‘unruly’ Wild West of digital humanities bit by bit, step by step.

And, by all means, allow researchers to ‘tinker’ with the data. Verheul: ‘There is a certain serendipity in working with big data that allows for playfulness.’

See also:

 

Breaking down walls in digital preservation (Part 2)

Here is part 2 of the digital preservation seminar which identified ways to break down walls between research & development and daily operations in libraries and archives (continued from Breaking down walls in digital preservation, part 1). The seminar was organised by SCAPE and the Open Planets Foundation in The Hague on 2 April 2014. – Report & photographs by Inge Angevaare, visualisations by Elco van Staveren

Ross King picture wall between daily operations and research and development in digital preservation

Ross King of the Austrian Institute of Technology (and of OPF) kicking off the afternoon session by singlehandedly attacking the wall between daily operations and R&D

Experts meet managers

Ross King of the Austrian Institute of Technology described the features of the (technical) SCAPE project which intends to help institutions build preservation environments which are scalable – to bigger files, to more heterogeneous files, to a large volume of files to be processed. King was the one who identified the wall that exists between daily operations in the digital library and research & development (in digital preservation):

The wall between Production and R&D

The Wall between Production & R&D as identified by Ross King

Zoltan Szatucsket of the Hungarian National Archives shared his experiences with one of the SCAPE tools from a manager’s point of view: ‘Even trying out the Matchbox tool from the SCAPE project was too expensive for us.’ King admitted that the Matchbox case had not yet been entirely successful. ‘But our goal remains to deliver tools that can be downloaded and used in practice.’

Maureen Pennock of the British Library sketched her organisation’s journey to embed digital preservation [link to slides to follow]. Her own digital preservation department (now at 6 fte) was moved around a few times before it was nested in the Collection Care department which was then merged with Collection management. ‘We are now where we should be: in the middle of the Collections department and right next to the Document processing department. And we work closely with IT, strategy development, procurement/licensing and collection security and risk management.’

British Library strategy for digital preservation

The British Library’s strategy calls for further embedding of digital preservation, without taking the formal step of certification

Pennock elaborated on the strategic priorities mentioned above (see slides) by noting that the British Library has chosen not to strive for formal certification within the European Framework (unlike, e.g., the Dutch KB). Instead, the BL intends to hold bi-annual audits to measure progress. The BL intends to ensure that ‘all staff working with digital content understand preservation issues associated with it.’ Questioned by the Dutch KB’s Hildelies Balk, Pennock confirmed that the teaching materials the BL is preparing could well be shared with the wider digital preservation community. Here is Pennock’s concluding comment:

Digital preservation is like a bicycle - one size does not fit everyone, but everyone still recognises it as a library

Digital preservation is like a bicycle – one size doesn’t fit everyone … but everybody still recognises the bicycle

Marcin Werla from the Polish Supercomputing & Networking Centre PSNC provided an overview of the infrastructure PSNC is providing for research institutions and for cultural heritage institutions. It is a distributed network based on the Polish fast (20GB) optical network:

PSCN network for digital libraries and archives

The PSNC network includes facilities for long-term preservation

Interestingly, the network services mostly smaller institutions. The Polish National Library and Archives have built their own systems.

Werla stressed that proper quality control at the production stage is difficult because of the bureaucratic Polish public procurement system.

Heiko Tjalsma of the Dutch research data archive DANS pitched the 4C project which was established to  ‘create a better understanding of digital curation costs through collaboration.’

Heiko Tjalsma about the 4C Project to get a grip on digital curation costs

Tjalsma: ‘We can only get a better idea of what digital curation costs by collaborating and sharing data’

At the moment there are several cost models available in the community (see, e.g., earlier posts), but they are difficult to compare. The 4C project intends to a) establish an international curation cost exchange framework, and b) build a Cost Concept Model – which will define what to include in the model and what to exclude.

The need for a clearer picture of curation costs is undisputed, but, Tjalsma added, ‘it is very difficult to gather detailed data, even from colleagues.’ Our organisations are reticent to make their financial data available. And both ‘time’ and ‘scale’ make matter more difficult. The only way to go seems to be anonimisation of data, and for that to work, the project must attract as many participants as possible. So: please register at http://www.4cproject.eu/community-resources/stakeholder-participation – and participate.

Building bridges between expert and manager

The last part of the day was devoted to building bridges between experts and managers. Dirk van Suchodeletz of the University of Freiburg introduced the session with a topic that is often considered an ‘expert-only’ topic: emulation.

Dirk von Suchodeletz

Dirk von Suchodeletz: ‘The EaaS project intends to make emulation available for a wider audience by providing it as a service.’

The emulation technique has been around for a while, and it is considered one of the few methods of preservation available for very complex digital objects – but takeup by the community has been slow, because it is seen as too complex for non-experts. The Emulation as a Service project intends to bridge the gap to practical implementation by taking away many of the technical worries from memory institutions. A demo of Emulation as a Service is available for OPF members. Von Suchodeletz encouraged his audience to have a look at it, because the service can only be made to work if many memory institutions decide to participate.

Seminar round table Managing Digital Preservation

Getting ready for the last roundtable discussion about the relationship between experts and managers

How R&D and the library business relate

‘What inspired the EaaS project,’ Hildelies Balk (KB) wanted to know from von Suchodeletz, ‘was it your own interest or was there some business requirement to be met?’ Von Suchodeletz admitted that it was his own research interest that kicked off the project; business requirements entered the picture later.

Birgit Henriksen of the Royal Library, Denmark: ‘We desperately need emulation to preserve the games in our collection, but because it is such a niche, funding is hard to come by.’ Jacqueline Slats of the Dutch National Archives echoed this observation: ‘The NA and the KB together developed the emulation tool Dioscuri, but because there was no business demand, development was halted. We may pick it up again as soon as we start receiving interactive material for preservation.’

This is what happened next, as visualised by Elco van Staveren:

Some highlights from the discussions:

  • Timing is of the essence. Obviously, R&D is always ahead of operations, but if it is too far ahead, funding will be difficult. Following user needs is no good either, because then R&D becomes mere procurement. Are there any cases of proper just-in-time development? Barbara Sierman of the KB suggested Jpylyzer (translation of Jpylyzer for managers) – the need arose for quality control in a massive TIFF/JP2000 migration at the KB intended to cut costs, and R&D delivered.
  • Another successful implementation: the Pronom registry. The National Archives had a clear business case for developing it. On the other hand, the GDFR technical registry did not tick the boxes of timeliness, impetus and context.
  • For experts and managers to work well together managers must start accepting a certain amount of failure. We are breaking new ground in digital preservation, failures are inevitable. Can we make managers understand that even failures make us stronger because the organisation gains a lot of experience and knowledge? And what is an acceptable failure rate? Henriksen suggested that managing expectations can do the trick. ‘Do not expect perfection.’

    Seminar managing digital preservation panel members

    Some of the panel members (from left to right) Maureen Pennock (British Library), Hildelies Balk (KB), Mies Langelaar (Rotterdam Municipal Archives), Barbara Sierman (KB) and Mette van Essen (Dutch National Archives)

  • We need a new set of metrics to define success in the ever changing digital world.
  • Positioning the R&D department within Collections can help make collaboration between the two more effective (Andersen, Pennock). Henriksen: ‘At the Danish Royal Library we have started involving both R&D and collections staff in scoping projects.’
  • And then again … von Suchodeletz suggested that sometimes a loose coupling between R&D and business can be more effective, because staff in operations can get too bogged down by daily worries.
  • Sometimes breaking down the wall is just too much to ask, suggested van Essen. We may have to decide to jump the wall instead, at least for the time being.
  • Bridge builders can be key to making projects succeed, staff members who speak both the languages of operations and of R&D. Balk and Pennock stressed that everybody in the organisation should know about the basics of digital preservation.
  • Underneath all of the organisation’s doings must lie a clear common vision to inspire individual actions, projects and collaboration.

In conclusion: participants agreed that this seminar had been a fruitful counterweight to technical hackatons in digital preservation. More seminars may follow. If you participated (or read these blogs), please use the commentary box for any corrections and/or follow-up.

‘In an ever changing digital world, we must allow for projects to fail – even failures bring us lots of knowledge.’

 

Breaking down walls in digital preservation (Part 1)

People & knowledge are the keys to breaking down the walls between daily operations and digital preservation (DP) within our organisations. DP is not a technical issue, but information technology must be embraced as as a core feature of the digital library. Such were some of the conclusions of the seminar organised by the SCAPE project/Open Planets Foundation at the Dutch National Library (KB) and National Archives (NA) on Wednesday 2 April. – Report & photographs by Inge Angevaare, visualisations by Elco van Staveren

Newcomer questions some current practices

Menno Rasch (KB)

Menno Rasch (KB): ‘Do correct me if I am wrong’

Menno Rasch was appointed Head of Operations at the Dutch KB 6 months ago – but  ‘I still feel like a newcomer in digital preservation.’ His division includes the Collection Care department which is responsible for DP. But there are close working relationships with the Research and IT departments in the Innovation Division. Rasch’s presentation about embedding DP in business practices in the KB posed some provocative questions:

  • We have a tendency to cover up our mistakes and failures rather than expose them and discuss them in order to learn as a community. That is what pilots do. The platform is there, the Atlas of Digital Damages set up by the KB’s Barbara Sierman, but it is being underused. Of course lots of data are protected by copyright or privacy regulations, but there surely must be some way to anonimise the data.
  • In libraries and archives, we still look upon IT as ‘the guys that make tools for us’. ‘But IT = the digital library.’
  • We need to become more pragmatic. Implementing the OAIS standard is a lot of work – perhaps it is better to take this one step at a time.
  • ‘If you don’t do it now, you won’t do it a year from now.’
  • ‘Any software we build is temporary – so keep the data, not the software.’
  • Most metadata are reproducible – so why not store them in a separate database and put only the most essential preservation metadata in the OAIS information package? That way we can continue improving the metadata. Of course these must be backed up too (an annual snapshot?), but may tolerate a less expensive storage regime than the objects.
  • About developments at the KB: ‘To replace our old DIAS system, we are now developing software to handle all of our digital objects – which is an enormous challenge.’
SCAPE/OPF seminar on managing digital preservation, 4 April 2014, The Hague

The SCAPE/OPF seminar on Managing Digital Preservation, 2 April 2014, The Hague

Digital collections and the Titanic

Zoltan Szatucsket from the Hungarian National Archives used the Titanic for his presentation’s metaphor – without necessarily implying that we are headed for the proverbial iceberg, he added. Although, …  ‘many elements from the Titanic story can illustrate how we think’:

  • Titanic received many warnings about ice formations, and yet it was sailing at full speed when disaster struck.
  • Our ship – the organisation – is quite conservative. It wants to deal with digital records in the same way it deals with paper records. And at the Hungarian National Archives IT and archivist staff are in the same department, which does not work because they do not speak each others’ language.

    Zoltan Szatucsket SCAPESeminar

    Zoltan Szatucsket argued that putting together IT staff and archivists in the Hungarian National Archives caused ‘language’  problems; his Danish colleagues felt that in their case close proximity had rather helped improve communications

  • The captain must acquire new competences. He must learn to manage staff, funding, technology, equipment, etc. We need processes rather than tools.
  • The crew is in trouble too. Their education has not adapted to digital practices. Underfunding in the sector is a big issue. Strangely enough, staff working with medieval resources were much quicker to adopt digital practices than those working with contemporary material. They seem to want to put off any action until legal transfer to the archives actually occurs (15-20 years).
  • Echoing Menno Rasch’s presentation, Szatucsket asked the rhetorical question: ‘Why do we not learn from our mistakes?’ A few months after Titanic, another ship went down in similar circumstances
  • Without proper metadata, objects are lost forever.
  • Last but not least: we have learned that digital preservation is not a technical challenge. We need to create a complete environment in which to preserve.
Szatucsek at Digital Preservation seminar

Is DP heading for the iceberg as well? Visualisation of Szatucsek’s presentation.

OPF: trust, confidence & communication

Ed Fay was appointed director of the Open Planets Foundation (OPF) only six weeks ago. But he presented a clear vision of how the OPF should function within the community, crack in the middle, as a steward of tools, a champion of open communications, trust & confidence, a broker between commercial and non-commercial interests:

Ed Fay Open Planets Foundation vision

Ed Fay’s vision of the Open Planets Foundation’s role in the digital preservation community

Fay also shared some of his experiences in his former job at the London School of Economics:

Ed Fay London School of Economics Organisation

Ed Fay illustrated how digital preservation was moved around a few times in the London School of Economics Library, until it found its present place in the Library division

So, what works, what doesn’t?

The first round-table discussion was introduced by Bjarne Anderson of the Statsbiblioteket Aarhus (DK). He sketched his institution’s experiences in embedding digital preservation.

Bjarene Andersen Statsbiblioteket Aarhus

Bjarne Andersen (right) conferring with Birgit Henriksen (Danish Royal Library, left) and Jan Dalsten Sorensen (Danish National Archives. ‘SCRUM has helped move things along’

He mentioned the recently introduced SCRUM-based methodology as really having helped to move things along – it is an agile way of working which allows for flexibility. The concept of ‘user stories’ helps to make staff think about the ‘why’. Menno Rasch (KB) agreed: ‘SCRUM works especially well if you are not certain where to go. It is a step-by-step methodology.’

Some other lessons learned at Aarhus:

  • The responsibility for digital preservation cannot be with the developers implementing the technical solutions
  • The responsibility needs to be close to ‘the library’
  • Don’t split the analogue and digital library entirely – the two have quite a lot in common
  • IT development and research are necessary activities to keep up with a changing landscape of technology
  • Changing the organisation a few times over the years helped us educate the staff by bringing traditional collection/library staff close to IT for a period of time.
SCAPE seminar group discussion

Group discussion. From the left: Jan Dalsten Sorensen (DK), Ed Fay (OPF), Menno Rasch (KB), Marcin Werla (PL), Bjarne Andersen (DK), Elco van Staveren (KB, visualising the discussion), Hildelies Balk (KB) and Ross King (Austria)

And here is how Elco van Staveren visualised the group discussion in real time:

Some highlights from the discussion:

  • Embedding digital preservation is about people
  • It really requires open communication channels.
  • A hierarchical organisation and/or an organisation with silos only builds up the wall. Engaged leadership is called for. And result-oriented incentives for staff rather than hierarchical incentives.
  • Embedding digital preservation in the organisation requires a vision that is shared by all.
  • Clear responsibilities must be defined.
  • Move the budgets to where the challenges are.
  • The organisation’s size may be a relevant factor in deciding how to organise DP. In large organisations, the wheels move slowly (no. of staff in the Hungarian National Archives 700; British Library 1,500; Austrian National Library 400; KB Netherlands 300, London School of Economics 120, Statsbiblioteket Aarhus 200).
  • Most organisations favour bringing analogue and digital together as much as possible.
  • When it comes to IT experts and librarians/archivists learning each other’s languages, it was suggested that maybe hard IT staff need not get too deeply involved in library issues – in fact, some IT staff might consider it bad for their careers. Software developers, however, do need to get involved in library/archive affairs.
  • Management must also be taught the language of the digital library and digital preservation.

(Continued in Breaking down walls in digital preservation, part 2)

Seminar agenda and links to presentations

Keep Calm 'cause Titanic is Unsinkable

© 2018 KB Research

Theme by Anders NorenUp ↑