This week, the annual DHBenelux conference will take place in Belval, Luxembourg. It will bring together practically all DH scholars from Belgium (BE), the Netherlands (NE) and Luxembourg (LUX). You can read the full program and all abstracts on the website. Two presentations are by members of our DH team (Steven Claeyssens & Martijn Kleppe) and one presentation is by our current researcher in residence (Puck Wildschut – Radboud University Nijmegen). Please find the first paragraphs of their abstracts below:
Succeed Interoperability Workshop
2 October 2014, National library of the Netherlands, The Hague
Speaking the same language is one thing, understanding what the other is saying is another… – Carl Wilson at the Succeed technical workshop on the interoperability of digitisation platforms.
Interoperability is a term widely used to describe the ability of making systems and organisations work together (inter-operate). However, interoperability is not just about the technical requirements for information exchange. In a broader definition, it also includes social, political, and organisational factors that impact system to system performance and is related to questions of (commercial) power and market dominance (See http://en.wikipedia.org/wiki/Interoperability).
On 2 October 2014, the Succeed project organised a technical workshop on the interoperability of digitisation platforms at the National library of the Netherlands in The Hague. 19 researchers, librarians, and computer scientists from several European countries participated in the workshop (see SUCCEED Interoperability Workshop_Participants). In preparation of the workshop, the Succeed project team asked participants to fill out a questionnaire containing several questions on the topic of interoperability. The questionnaire was filled out by 12 participants; the results were presented during the workshop. The programme included a number of presentations and several interactive sessions to come to a shared view on what interoperability is about, what are the main issues and barriers to be dealt with, and how we should approach these.
The main goals of the workshop were:
- Establishing a baseline for interoperability based on the questionnaire and presentations of the participants
- Formulating a common statement on the value of interoperability
- Defining the ideal situation with regard to interoperability
- Identifying the most important barriers
- Formulating an agenda
Presentation by Carl Wilson
To establish a baseline (what is interoperability and what is its current status in relation to digitisation platforms), our programme included a number of presentations. We invited Carl Wilson of the Open Preservation Foundation (previously the Open Planets Foundation) for the opening speech. He set the scene by sharing a number of historical examples (in IT and beyond) of interoperability issues. Carl made clear that interoperability in IT has many dimensions:
- Technical dimensions
Within the technical domain, two types of interoperability can be discerned, i.e.:
Syntactical interoperability (aligning metadata formats); “speaking the same language”, and Semantical interoperability; “understanding each other
- Organizational /Political dimensions
- Legal (IPR) dimensions
- Financial dimensions
When approaching operability issues, it might help to take into account these basic rules:
- Test early (automated testing, virtualisation)
- Test often
Finally, Carl stressed that the importance of interoperability will further increase with the rise of the Internet of Things, as it involves more frequent information exchange between more and more devices.
The Succeed Interoperability platform
After Carl Wilson’s introductory speech, Enrique Molla from the University of Alicante (UA is project leader of the Succeed project) presented the Succeed Interoperability framework, which allows users to test and combine a number of digitisation tools. The tools are made available as web services by a number of different providers, which allows the user to try them out online without having to install any of these tools locally. The Succeed project met with a number of interoperability related issues when developing the platform. For instance, the web services have a number of different suppliers; some of them are not maintaining their services. Moreover, the providers of the web services often have commercial interests, which means that they impose limits such as a maximum number of users of pages tested through the tools.
Presentations by participants
After the demonstration of the Succeed Interoperability platform, the floor was open for the other participants, many of whom had prepared a presentation about their own project and their experience with issues of interoperability.
Bert Lemmens presented the first results of the Preforma Pre-Commercial Procurement project (running January 2014 to December 2017). A survey performed by the project made clear that (technically) open formats are in many cases not the same as libre/ open source formats. Moreover, even when standard formats are used across different projects, they are often implemented in multiple ways. And finally, when a project or institution has found their technically appropriate format, they may often find that limited support is available on how to adopt the format.
Gustavo Candela Romero gave an overview of the services provided by the Biblioteca Virtual Miguel de Cervantes (BVMC) The BVMC developed their service oriented architecture with the purpose of facilitating online access to Hispanic Culture. The BVMC offers their data as OAI-PMH, allowing other institutions or researchers to harvest their content. Moreover, the BVMC is working towards publishing their resources in RDF and making it available through a SPARQL Endpoint.
Alastair Dunning and Pavel Kats explained how Europeana and The European Library are working towards a shared storage system for aggregators with shared tools for the ingestion and mapping process. This will have practical and financial benefits, as shared tools will reduce workflow complexity, are easier to sustain and, finally, cheaper.
Clara Martínez Cantón presented the work of the Digital Humanities Innovation Lab (LINHD), the research centre on Digital Humanities at the National Distance Education University (UNED) in Spain. The LINHD encourages researchers to make use of Linked Data. Clara showed the advantages of using Linked Data in a number of research projects related to metrical repertoires. In these projects, a number of interoperability issues (such as a variety of structures of the data, different systems used, and variation in the levels of access) were by-passed by making use of a Linked Data Model.
Marc Kemps-Snijders made clear how the Meertens Institute strives to make collections and technological advances available to the research community and the general public by providing technical support and developing applications. Moreover, the Meertens Institute is involved in a number of projects related to interoperability, such as Nederlab and CLARIN.
Menzo Windhouwer further elaborated on the projects deployed by CLARIN (Common Language Resources and Technology Infrastructure). CLARIN is a European collaborative effort to create, coordinate and make language resources and technology available and readily useable. CLARIN is involved in setting technical standards and creating recommendations for specific topics. CLARIN has initiated the Component MetaData Infrastructure (CMDI), which is an integrated semantic layer to
achieve semantic interoperability and overcome the differences between different metadata structures.
Presentation of responses to the Succeed questionnaire and overview of issues mentioned
To wrap up the first part of the programme, and to present an overview of the experiences and issues described by the participants, Rafael Carrasco from the University of Alicante presented the results of the Succeed online questionnaire (see also below).
Most institutions which filled out the questionnaire made clear that they are already addressing interoperability issues. They are mainly focusing on technical aspects, such as the normalization of resources or data and the creation of an interoperable architecture and interface. The motives for striving for interoperability were threefold: there is a clear demand by users; interoperability means an improved quality of service; and interoperability through cooperation with partner institutions brings many benefits to the institutions themselves. The most important benefits mentioned were: to create a single point of access (i.e., a better service to users), and to reduce the cost of software maintenance.
Tomasz Parkola and Sieta Neuerburg proceeded by recapturing the issues presented in the presentations. Clearly, all issues mentioned by participants could be placed in one of the dimensions introduced by Carl Wilson, i.e. Technical, Organizational/ Political, Financial, or Legal.
2. What is the value of interoperability?
Having established our baseline of the current status of interoperability, the afternoon programme of the workshop further included a number of interactive sessions, which were led by Irene Haslinger of the National library of the Netherlands. To start off, we asked the participants to write down their notion of the value of interoperability.
The following topics were brought up:
- Increased synergy
- More efficient/ effective allocation of resources
- Cost reduction
- Improved usability
- Improved data accessibility
3. How would you define the ideal situation with regard to interoperability?
After defining the value of interoperability, the participants were asked to describe their ‘ideal situation’.
The participants mainly mentioned their technical ideals, such as:
- Real time/ reliable access to data providers
- Incentives for data publishing for researchers
- Improved (meta)data quality
- Use of standards
- Ideal data model and/ or flexibility in data models
- Only one exchange protocol
- Automated transformation mechanism
- Unlimited computing capacity
- All tools are “plug and play” and as simple as possible
- Visualization analysis
Furthermore, a number of organizational ideals was brought up:
- The right skills reside in the right place/ with the right people
- Brokers (machines & humans) help to achieve interoperability
4. Identifying existing barriers
After describing the ‘ideal world’, we asked the participants to go back to reality and identify the most important barriers which – in their view – stop us from achieving the interoperability ideals described above.
In his presentation of the responses to the questionnaire, Rafael Carrasco had already identified the four issues considered to be the most important barriers for the implementation of interoperability:
- Insufficient expertise by users
- Insufficient documentation
- The need to maintain and/ or adapt third party software or webservices
- Cost of implementation
The following barriers were added by the participants:
Technical issues (in order of relevance)
- Pace of technological developments/ evolution
- Legacy systems
- Persistence; permanent access to data
- Stabilizing standards
Organizational/ Political issues (in order of relevance)
- Communication and knowledge management
- Lack of 21st century skills
- No willingness to share knowledge
- “Not invented here”-syndrome
- Establishment of trust
- Bridging the innovation gap; responsibility as well as robustness of tools
- Conflicts of interest between all stakeholders (e.g. different standards)
- Decision making/ prioritizing
- Current (EU) funding system hinders interoperability rather than helping it (funding should support interoperability between rather than within projects)
Financial issues (in order of relevance)
- Return of investment
- Commercial interests often go against interoperability
- Issues related to Intellectual Property Rights
5. Formulate an agenda: Who should address these issues?
Having identified the most important issues and barriers, we concluded the workshop by an open discussion centering on the question: who should address these issues?
In the responses to the questionnaire, the participants had identified three main groups:
- Standardization bodies
- The research community
- Software developers
During the discussion, the participants added some more concrete examples;
- Centres of Competence established by the European Commission should facilitate standardization bodies by both influence the agenda (facilitate resources) and by helping institutions to find the right experts for the interoperability issues (and vice versa)
- Governmental institutions, including universities and other educational institutions, should strive to improve education in “21st century skills”, to improve users’ understanding of technical issues
At the end of our workshop, we concluded that, to achieve a real impact on the implementation of interoperability, there needs to be a demand from the side of the users, while the institutions and software developers need to be facilitated both organizationally and financially. Most probably, European centres of competence, such as Impact, have a role to play in this field. This is also most relevant in relation to the Succeed project. One of the project deliverables will be a Roadmap for funding Centres of Competence in work programmes. The role of Centres of Competences in relation to interoperability is one of the topics discussed in this document. As such, the results of the Succeed workshop on interoperability will be used as input for this roadmap.
We would like to thank all participants for their contribution during the workshop and look forward to working with you on interoperability issues in the future!
More pictures on Flickr
On 19 and 20 May, the National Library of the Netherlands (KB) visited the Digitisation Days which were held at the Biblioteca Nacional in Madrid. The conference was supported by the European Commission, and organised by the Support Action Centre of Competence in Digitisation (Succeed) project and the IMPACT Centre of Competence (IMPACT CoC) with the cooperation of Biblioteca Nacional de España.
For the National Library, being a collection holder, the Succeed awards ceremony was one of the highlights of the conference, because it showed the application of technology to actual collections. The Succeed awards aim to recognise successful digitisation programmes in the field of historical texts, especially those using the latest technology.
Two prizes went to the Hill Museum and Manuscript Library and the Centre d’Études Supérieures de la Renaissance, while two Commendations of Merit were awarded to the London Metropolitan Archives/ University College London and to Tecnilógica.
In her role of member of the IMPACT CoC executive board, the KB’s Head of Research, Hildelies Balk, took part in the ceremony and awarded the Commendation of Merit to the London Metropolitan Archives/ University College London for their Great Parchment Book project. You will find a short video about the project here.
Moreover, the KB hosted an interesting and fruitful Round table workshop on the future of research and funding in digitisation and the possible roles of Centres of Competence on 20 May. Some 30 librarians and researchers joined this workshop, and discussed the below topics:
- What research is needed to further the development of the Digital Library?
- How can Centres of Competence assist your research or development?
- In digitisation, are we ready to move the focus from quantity to quality?
- What enrichments, e.g. in Named Entity Recognition, Linked Data services, or crowdsourcing for OCR correction, would be most beneficial for digitisation?
- What’s your take on Labs and Virtual Research Environments?
- What would you like to do in these types of research settings?
- What do you expect to get out of them?
The preliminary outcomes of the workshop show that the main goal for institutions is to give users unrestricted access to data. During the workshop, the participants discussed the many layered aspects of these three topics, i.e. ‘users’, ‘access’, and ‘data’. Moreover, the participants gave their view on the following questions in relation to these topics:
- What stops us from making progress?
- What helps us to make progress?
- And what role could CoCs play in this?
The outcomes of the workshop have been documented and will be used as a starting point for the roadmap to further development of digitisation and the digital library, which will be produced within the Succeed project. This roadmap will serve to support the European Commission in preparing the 2014–2020 Work Programme for Research and Innovation.
Op 19-20 mei worden in Madrid de Digitisation Days gehouden. Wat valt er te beleven en waarom zou je erheen gaan? We vroegen het Hildelies Balk van de Koninklijke Bibliotheek, die voorzitter is van het bestuur van de organisator, het IMPACT Centre of Competence (IMPACT CoC). – interview en foto Inge Angevaare
Voor wie zijn de Digitisation Days interessant?
‘Voor iedereen die te maken heeft met gedigitaliseerde, historische teksten. Die zijn vaak moeilijk bruikbaar omdat de leessoftware veel fouten maakt. Dat komt bij voorbeeld omdat het originele drukwerk zelf al slecht was, of omdat de drukletter slecht leesbaar is:
‘De software die de plaatjes moet omzetten in leesbare tekst maakt daarvan:
VVt Venetien den 1.Junij, Anno 1618.
DJgn i f paffato te S’ aö’Jifeert mo?üen/bah
te / sbnbe bele btr felbrr geiufttceert baer bnber
eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu
enbeeemgljen bifet Cbeiiupcen berbonbru befe
‘De KB en andere bibliotheken willen dit soort teksten in bruikbare vorm aanbieden aan wetenschappers. Dus zoeken we al sinds 2008 in Europees verband naar methoden om de teksten te verbeteren, liefst automatisch. Het unieke aan het IMPACT Centre of Competence én van de Digitisation Days is dat daar drie belangengroepen bij elkaar komen die elkaar versterken:
- instellingen met collecties die gedigitaliseerd zijn (bibliotheken, archieven, musea)
- onderzoekers die methoden ontwikkelen om gedigitaliseerde tekst te verbeteren (beeldherkenning en – verbetering, patroonherkenning, taaltechnologie)
- leveranciers van producten en diensten voor digitalisering en OCR (optical character recognition).
‘Door de aanwezigheid van al deze mensen krijgt de bezoeker in twee dagen tijd een compleet overzicht van wat er momenteel allemaal mogelijk is – op het gebied van documentanalyse, taaltechnologie en post-correctie van OCR.’
Wat zie jij als het grootste nut van het Centre of Competence en de Digitisation Days?
‘Het IMPACT Centre of Competence helpt erfgoedinstellingen belangrijke beslissingen te nemen. We evalueren bestaande tools en publiceren daarover. Er is zelfs heel goede evaluatiesoftware. En we leveren begeleiding; als een instelling wil gaan digitaliseren kunnen wij ze van advies dienen. Wat zijn de beste tools en methoden in hun specifieke geval? Wat voor kwaliteit mag je verwachten? Wat gaat het kosten?’
‘De Digitisation Days zijn een perfecte manier voor erfgoedinstellingen om elkaar te ontmoeten, uitgebreid ervaringen en kennis te delen. Bijvoorbeeld: Hoe ga je om met leveranciers? Hoe geef je digitalisering een plek in je organisatie? Maar ook: hoe zetten we nieuwe projecten op? Hoe vinden we geldstromen? Op de tweede dag is er een workshop waarin we met belangstellenden gaan praten over de onderzoeksagenda voor digitalisering. Waar moeten we de nadruk op leggen? Meer kwantiteit of meer kwaliteit? Hoe kunnen we de plannen en budgetten van Europa beïnvloeden?’
Nu je het over Europa hebt: IMPACT, IMPACT Centre of Competence, SUCCEED – de aankondiging van de Digitisation Days staat vol met afkortingen. Kun je een beetje orde scheppen in die chaos?
‘IMPACT was het eerste Europese onderzoeksproject voor verbetering van toegang tot historische teksten dat mede op initiatief van de KB in 2008 is gestart. Toen het project afgelopen was, hebben een aantal IMPACT-partners de handen ineengeslagen om ervoor te zorgen dat de resultaten van het project onderhouden en verder ontwikkeld zouden worden. Dat is het IMPACT Centre of Competence. Geen project, maar een staande organisatie.’
‘Succeed is weer een Europees project en dus tijdelijk. De doelstellingen liggen helemaal in lijn met het IMPACT CoC, en daarom zijn er deels dezelfde partners bij betrokken. Doel is om te zorgen dat eindresultaten van Europese projecten op het gebied van de digitale bibliotheek goed onder de aandacht worden gebracht zodat ze gebruikt gaan worden in de praktijk. In het verleden bleven prototypes nog wel eens op de plank liggen. Dat is zonde van de investering.’
Wordt de stap van theorie naar praktijk echt gezet?
‘Jazeker! Die willen we juist alle aandacht geven. Daarom reiken we tijdens de Digitisation Days de Succeed awards uit – prijzen voor de beste toepassingen van innovatieve oplossingen. De jury heeft onlangs de kandidaten en de winnaars bekend gemaakt.’
Waar verheug jijzelf je het meest op tijdens de Digitisation Days?
‘Op de ontmoeting, het bij elkaar brengen van al die belanghebbenden. Collega’s van andere instellingen, de onderzoekers – juist uit de ontmoeting komen vaak spannende ideeën en oplossingen voort.’
2nd Succeed hackathon at the University of Alicante
Is there any one still out there who thinks a hackathon is a malicious break-in? Far from it. It is the best way for developers and researchers to get together and work on new tools and innovations. The 2nd developers workshop / hackathon organised on 10-11 April by the Succeed Project was a case in point: bringing together people to work on new ideas and new inspiration for better OCR. The event was held in the “Claude Shannon” aula of the Department of Software and Computing Systems (DLSI) of the University of Alicante, Spain. Claude Shannon was a famous mathematician and engineer and is also known as the “father of information theory”. So it seems like a good place to have a hackathon!
Clemens explains what a hackathon is and what we hope to achieve with it for Succeed.
Same as last year, we again provided a wiki upfront with some information about possible topics to work on, as well as a number of tools and data that participants could experiment with before and during the event. Unfortunately there was an unexpectedly high number of no-shows this time – we try to keep these events free and open to everyone, but may have to think about charging at least a no-show fee in the future, as places are usually limited. Or did those hackers simply have to stay home to fix the heartbleed bug on their servers? We will probably never find out.
Collaboration, open source tools, open solutions
Nevertheless, there was a large enough group of programmers and researchers from Germany, Poland, the Netherlands and various parts of Spain eager to immerse themselves deeply into a diverse list of topics. Already in the introduction we agreed to work on open tools and solutions, and quickly identified some areas in which open source tool support for text digitisation is still lacking (see below). Actually, one of the first things we did was to set up a local git repository, and people were pushing code samples, prototypes and interesting projects to share with the group during both days.
What’s the status of open source OCR?
Accordingly, Jesús Dominguez Muriel from Digibís (the company that also made http://www.digibis.com/dpla-europeana/) started an investigation into open source OCR tools and frameworks. He made a really detailed analysis of the status of open source OCR, which you can find here. Thanks a lot for that summary, Jesús! At the end of his presentation, Jesús also suggested an “algorithm wikipedia” – I guess something similar to RosettaCode but then specifically for OCR. This would indeed be very useful to share algorithms but also implementations and prevent reinventing (or reimplementing) the wheel. Something for our new OCRpedia, perhaps?
A method for assessing OCR quality based on ngrams
As turned out on the second day, a very promising idea seemed to be using ngrams for assessing the quality of an OCR’ed text, without the need for ground truth. Well, in fact you do still need some correct text to create the ngram model, but one can use texts from e.g. Project Gutenberg or aspell for that. Two groups started to work on this: while Willem Jan Faber from the KB experimented with a simple Python script for that purpose, the group of Rafael Carrasco, Sebastian Kirch and Tomasz Parkola decided to implement this as a new feature in the Java ocrevalUAtion tool (check the work-in-progress “wip” branch).
Jesús in the front, Rafael, Sebastian and Tomasz discussing ngrams in the back.
Aligning text and segmentation results
Another very promising development was started by Antonio Corbi from the University of Alicante. He worked on a software to align plain text and segmentation results. The idea is to first identify all the lines in a document, segment them into words and eventually individual charcaters, and then align the character outlines with the text in the ground truth. This would allow (among other things) creating a large corpus of training material for an OCR classifier based on the more than 50,000 images with ground truth produced in the IMPACT Project, for which correct text is available, but segmentation could only be done on the level of regions. Another great feature of Antonio’s tool is that while he uses D as a programming language, he also makes use of GTK, which has the nice effect that his tool does not only work on the desktop, but also as a web application in a browser.
OCR is complicated, but don’t worry – we’re on it!
Gustavo Candela works for the Biblioteca Virtual Miguel de Cervantes, the largest Digital Library in the Spanish speaking world. Usually he is busy with Linked Data and things like FRBR, so he was happy to expand his knowledge and learn about the various processes involved in OCR and what tools and standards are commonly used. His findings: there is a lot more complexity involved in OCR than appears at first sight. And again, for some problems it would be good to have more open source tool support.
In fact, at the same time as the hackathon, at the KB in The Hague, the ‘Mining Digital Repositories‘ conference was going on where the problem of bad OCR was discussed from a scholarly perspective. And also there, the need for more open technologies and methods was apparent:[tweet 454528200572682241 hide_thread=’true’]
Open source border detection
One of the many technologies for text digitisation that are available in the IMPACT Centre of Competence for image pre-processing is Border Removal. This technique is typically applied to remove black borders in a digital image that have been captured while scanning a document. The borders don’t contain any information, yet they take up expensive storage space, so removing the borders without removing any other relevant information from a scanned document page is a desirable thing to do. However, there is no simple open source tool or implementation for doing that at the moment. So Daniel Torregrosa from the University of Alicante started to research the topic. After some quick experiments with tools like imagemagick and unpaper, he eventually decided to work on his own algorithm. You can find the source here. Besides, he probably earns the award for the best slide in a presentation…showing us two black pixels on a white background!
A great venue
All in all, I think we can really be quite happy with these results. And indeed the University of Alicante also did a great job hosting us – there was an excellent internet connection available via cable and wifi, plenty of space and tables to discuss in groups and we were distant enough from the classrooms not to be disturbed by the students or vice versa. Also at any time there was excellent and light Spanish food – Gazpacho, Couscous with vegetables, assorted Montaditos, fresh fruit…nowadays you won’t make hackers happy with just pizza anymore! Of course there were also ice-cooled drinks and hot coffee, and rumours spread that there were also some (alcohol-free?) beers in the cooler, but (un)fortunately there is no more documentary evidence of that…
To be continued!
If you want to try out any of the software yourself, just visit our github and have go! Make sure to also take a look at the videos that were made with participants Jesús, Sebastian and Tomasz, explaining their intentions and expectations for the hackathon. And at the next hackathon, maybe we can welcome you too amongst the participants?
People & knowledge are the keys to breaking down the walls between daily operations and digital preservation (DP) within our organisations. DP is not a technical issue, but information technology must be embraced as as a core feature of the digital library. Such were some of the conclusions of the seminar organised by the SCAPE project/Open Planets Foundation at the Dutch National Library (KB) and National Archives (NA) on Wednesday 2 April. – Report & photographs by Inge Angevaare, visualisations by Elco van Staveren
Newcomer questions some current practices
Menno Rasch was appointed Head of Operations at the Dutch KB 6 months ago – but ‘I still feel like a newcomer in digital preservation.’ His division includes the Collection Care department which is responsible for DP. But there are close working relationships with the Research and IT departments in the Innovation Division. Rasch’s presentation about embedding DP in business practices in the KB posed some provocative questions:
- We have a tendency to cover up our mistakes and failures rather than expose them and discuss them in order to learn as a community. That is what pilots do. The platform is there, the Atlas of Digital Damages set up by the KB’s Barbara Sierman, but it is being underused. Of course lots of data are protected by copyright or privacy regulations, but there surely must be some way to anonimise the data.
- In libraries and archives, we still look upon IT as ‘the guys that make tools for us’. ‘But IT = the digital library.’
- We need to become more pragmatic. Implementing the OAIS standard is a lot of work – perhaps it is better to take this one step at a time.
- ‘If you don’t do it now, you won’t do it a year from now.’
- ‘Any software we build is temporary – so keep the data, not the software.’
- Most metadata are reproducible – so why not store them in a separate database and put only the most essential preservation metadata in the OAIS information package? That way we can continue improving the metadata. Of course these must be backed up too (an annual snapshot?), but may tolerate a less expensive storage regime than the objects.
- About developments at the KB: ‘To replace our old DIAS system, we are now developing software to handle all of our digital objects – which is an enormous challenge.’
Digital collections and the Titanic
Zoltan Szatucsket from the Hungarian National Archives used the Titanic for his presentation’s metaphor – without necessarily implying that we are headed for the proverbial iceberg, he added. Although, … ‘many elements from the Titanic story can illustrate how we think’:
- Titanic received many warnings about ice formations, and yet it was sailing at full speed when disaster struck.
- Our ship – the organisation – is quite conservative. It wants to deal with digital records in the same way it deals with paper records. And at the Hungarian National Archives IT and archivist staff are in the same department, which does not work because they do not speak each others’ language.
- The captain must acquire new competences. He must learn to manage staff, funding, technology, equipment, etc. We need processes rather than tools.
- The crew is in trouble too. Their education has not adapted to digital practices. Underfunding in the sector is a big issue. Strangely enough, staff working with medieval resources were much quicker to adopt digital practices than those working with contemporary material. They seem to want to put off any action until legal transfer to the archives actually occurs (15-20 years).
- Echoing Menno Rasch’s presentation, Szatucsket asked the rhetorical question: ‘Why do we not learn from our mistakes?’ A few months after Titanic, another ship went down in similar circumstances
- Without proper metadata, objects are lost forever.
- Last but not least: we have learned that digital preservation is not a technical challenge. We need to create a complete environment in which to preserve.
OPF: trust, confidence & communication
Ed Fay was appointed director of the Open Planets Foundation (OPF) only six weeks ago. But he presented a clear vision of how the OPF should function within the community, crack in the middle, as a steward of tools, a champion of open communications, trust & confidence, a broker between commercial and non-commercial interests:
Fay also shared some of his experiences in his former job at the London School of Economics:
So, what works, what doesn’t?
The first round-table discussion was introduced by Bjarne Anderson of the Statsbiblioteket Aarhus (DK). He sketched his institution’s experiences in embedding digital preservation.
He mentioned the recently introduced SCRUM-based methodology as really having helped to move things along – it is an agile way of working which allows for flexibility. The concept of ‘user stories’ helps to make staff think about the ‘why’. Menno Rasch (KB) agreed: ‘SCRUM works especially well if you are not certain where to go. It is a step-by-step methodology.’
Some other lessons learned at Aarhus:
- The responsibility for digital preservation cannot be with the developers implementing the technical solutions
- The responsibility needs to be close to ‘the library’
- Don’t split the analogue and digital library entirely – the two have quite a lot in common
- IT development and research are necessary activities to keep up with a changing landscape of technology
- Changing the organisation a few times over the years helped us educate the staff by bringing traditional collection/library staff close to IT for a period of time.
And here is how Elco van Staveren visualised the group discussion in real time:
Some highlights from the discussion:
- Embedding digital preservation is about people
- It really requires open communication channels.
- A hierarchical organisation and/or an organisation with silos only builds up the wall. Engaged leadership is called for. And result-oriented incentives for staff rather than hierarchical incentives.
- Embedding digital preservation in the organisation requires a vision that is shared by all.
- Clear responsibilities must be defined.
- Move the budgets to where the challenges are.
- The organisation’s size may be a relevant factor in deciding how to organise DP. In large organisations, the wheels move slowly (no. of staff in the Hungarian National Archives 700; British Library 1,500; Austrian National Library 400; KB Netherlands 300, London School of Economics 120, Statsbiblioteket Aarhus 200).
- Most organisations favour bringing analogue and digital together as much as possible.
- When it comes to IT experts and librarians/archivists learning each other’s languages, it was suggested that maybe hard IT staff need not get too deeply involved in library issues – in fact, some IT staff might consider it bad for their careers. Software developers, however, do need to get involved in library/archive affairs.
- Management must also be taught the language of the digital library and digital preservation.
(Continued in Breaking down walls in digital preservation, part 2)
Throughout recent weeks, rumors spread at KB National Library of the Netherlands that there would be a party of programmers coming to the library to participate in a so-called “hackathon”. In the beginning, especially the IT department was rather curious: will we have to expect port scans being done from within the National Library’s network? Do we need to apply special security measures? Fortunately, none of that was necessary.
A “hackathon” is nothing to be afraid of, normally. On the contrary: the informal gatherings of software developers to work collaboratively on creating and improving new or existing software tools and/or data have emerged as a prominent pattern in recent years – in particular the hack4Europe series of hack days that is organized by Europeana has shown that this model can also be successfully applied in the context of cultural heritage digitization.
After that was sorted, a network switch with static IP addresses was deployed by the facilities department of the KB, thereby ensuring that participants of the event had a fast and robust internet connection at all times and allowing access to the public parts of the internet and the restricted research infrastructure of the KB at the same time – which received immediate praise from the hackers. Well done, KB!
So when the software developers from Austria, England, France, Poland, Spain and the Netherlands gathered at the KB last Thursday, everyone already knew they were indeed here to collaboratively work on one of the European projects the KB is involved in: the Succeed project. The project had called in software developers from all over Europe to participate in the 1st Succeed hackathon to work on interoperability of tools and workflows for text digitization.
There was a good mix of people from the digitization as well as digital preservation communities, with some additional Taverna expertise tossed in. While about half of the participants had participated in either Planets, IMPACT or SCAPE, the other half of them were new to the field and eager to learn about the outcomes of these projects and how Succeed will address them.
And so after some introduction followed by coffee and fruit, the 15 participants immersed straight away into the various topics that were suggested prior to the event as needing attention. And indeed, the results that were presented by the various groups after 1.5 days (but only 8 hours of effective working time) were pretty impressive…
The developers from INL were able to integrate some of the servlets they created in IMPACT and Namescape with the interoperability-framework – although also some bugs were uncovered while doing so. They will be fixed asap, rest assured! Also, with the help of the PSNC digital libraries team, Bob and Jesse were able to create a small training set for Tesseract, outperforming the standard dictionary despite some problems that were found in training Tesseract version 3.02. Fortunately it was possible to apply the training to version 3.01 and then run the generated classifier in Tesseract version 3.02, which is the current stable(?) release.
Even better: the colleagues from Poznań (who have a track record of successful participation at hackathons) had already done some training with Tesseract earlier and developed some supporting tools for it. Quickly Piotr created a tool description for the “cutouts” tool that automatically creates binarized clippings of characters from a source image. On the second day another feature of the cutouts application was added: creating an artificial image suitable for training Tesseract from the binarized character clippings. When finally wrapping the two operations in a Taverna workflow time eventually ran out, but given only little work remained we look forward to see the Taverna workflow for Tesseract training becoming available shortly! Certainly this is also of interest to the eMOP project in the US, in which the KB is a partner as well.
Meanwhile, another colleague from Poznań was investigating the process of creating packages for Debian-based Linux operating systems from existing (open source) tools. And despite using a laptop with OSX Mountain Lion, Tomasz managed to present a valid Debian package (including even icon and man page) – kudos! Certainly the help of Carl from the Open Planets Foundation was also partly to blame for that…next steps will include creating a change log straight off github. To be continued!
Another group attending the event were the team from LITIS lab at the University of Rouen. Thierry demonstrated the newest PLaIR tools such as the newspaper segmenter capable of automatically separating articles in scanned newspaper images. The PLaIR tools use GEDI as the encoding format, so some work was immediately invested by David to also support the PAGE format, the predominant format for document encoding used in the IMPACT tools, thereby in principle establishing interoperability between IMPACT and PLaIR applications. In addition, since the PLaIR tools are mostly already available as web services, Philippine started with creating Taverna workflows for these methods. We look forward to complement the existing IMPACT workflows with those additional modules from PLaIR!
All this was done without requiring any help from the PRImA group at the University of Salford, Greater Manchester, who are maintaining the PAGE format and a number of tools to support it. So with some free time on his hand, Christian from PRImA instead had a deeper look at Taverna and the PAGE serialization of the recently released open source OCR evaluation tool from the University of Alicante, the technical lead of the Centre of Competence, and found it to be working quite fine. Good to finally have an open source community tool for OCR evaluation with support for PAGE – and more features shall be added soon: we’re thinking word accuracy rate, bag-of-words evaluation and more – send us your feature requests (or even better: pull request).
We were particularly glad also that some developers beyond the usual MLA community suspects have found the way to the KB on those 2 days: a team from the Leiden University Medical Centre was also attending, keen on learning how they could use the T2-Client for their purposes. Initially slowed down by some issues encountered in deploying Taverna 2 Server on a Windows machine (don’t do it!), eventually Reinout and Eelke were able to resolve it simply by using Linux instead. We hope a further collaboration of Dutch Taverna users will arise from this!
Besides all the exciting new tools and features it was good to also see some others getting their hands dirty with (essential) engineering tasks – work progressed well on several issues from the interoperability-framework’s issue tracker: support for output directories is close to being fully implemented thanks to Willem Jan, and a good start was made on future MTOM support. Also Quique from the Centre of Competence was able to improve the integration between IMPACT services and the website Demonstrator Platform.
Without the help of experienced developers Carl from the Open Planets Foundation and Sven from the Austrian National Library (who had just conducted a training event for the SCAPE project earlier in the same week in London, and quickly decided to cross the channel for yet one more workshop), this would not have been so easily possible. While Carl was helping out everywhere at once, Sven found some time to fit in a Taverna training session after lunch on Friday, which was hugely appreciated from the audience.
After seeing all the powerful capabilities of Taverna in combination with the interoperability-framework web services and scripts in a live demo, no one needed further reassurance that it was well worth spending the time to integrate this technology and work with the interoperability-framework and it’s various components.
Everyone said they really enjoyed the event and found plenty of valuable things that they had learned and wanted to continue working with. So watch out for the next Succeed hackathon in sunny Alicante next year!
Digital preservation practitioners from Portico and from the National Library of The Netherlands (KB) organized a workshop on “Preservation at Scale” as part of iPres2013. This workshop aimed to articulate and, if possible, to address the practical problems institutions encounter as they collect, curate, preserve, and make content accessible at Internet scale.
Preservation at scale has entailed continual development of new infrastructure. In addition to preservation of digital documents and publications, data archives are collecting a vast amount of content which must be ingested, stored and preserved. Whether we have to deal with nuclear physics materials, social science datasets, audio and video content, or e-books and e-journals, the amount of data to be preserved is growing at a tremendous pace.
The presenters at this workshop each spoke from the experience of organizations in the digital preservation space that are wrestling with the issues introduced by large scale preservation. Each of these organizations has experienced annual increases in throughput of content, which they have had to meet, not just with technical adaptations (increases in hardware and software processing power), but often also with organizational re-definition, along with new organizational structures, processes, training, and staff development.
There were a number of broad categories addressed by the workshop speakers and participants:
- Technological adaptations
- Institutional adaptations
- Quality assurance at scale and across scale
- The scale of the long tail
- Economies and diseconomies of scale
Many of the organizations represented at this workshop have gone through one or more cycles of technological expansion, adaption, and platform migration to manage the current scale of incoming content, to take advantage of new advances in both hardware and software, or to respond to changes in institutional policy with respect to commercial vendors or suppliers.
These include both optimizations and large-scale platform migrations at the Koninklijke Bibliotheek, Harvard University Library, the Data Conservancy at Johns Hopkins University, and Portico, as well as the development by the PLANETS and SCAPE projects of frameworks, tools and test beds for implementing computing-intensive digital preservation processes such as the large-scale ingestion, characterization, and migration of large (multi-terabyte) and complex data sets.
A common challenge was reaching the limits of previous-generation architectures (whether those limits are those of capacity or of the capability to handle new digital object types), with the consequent need to make large-scale migrations both of content and of metadata.
For many of the institutions represented at this workshop, the increasing scale of digital collections has resulted in fundamental changes to those institutions themselves, including changes to an institution’s own definition of its mission and core activities. For these institutions, a difference in degree has meant a difference in kind.
For example, the Koninklijke Bibliotheek, the British Library, and Harvard University Library have all made digital preservation a library level mandate. This shift from relegating the preservation of digital content to an organizational sub-unit to ensuring that digital preservation is an organization-wide endeavor is challenging, as it requires changing the mindsets of many in each organization. It has meant reallocation of resources from other activities. It has necessitated strategic planning and budgeting for long-term sustainability of digital assets, including digital preservation tools and frameworks – a fundamental shift from one-time, project-based funding. It has meant making choices; we cannot do everything. It has meant comprehensive review of organizational structures and procedures, and has entailed equally comprehensive training and development of new skill sets for new functions.
Quality Assurance at Scale and Across Scales
A challenge to scaling up the acquisition and ingest of content is the necessity for quality assurance of that content. Often institutions are far downstream from the creators of content. This brings along many uncertainties and quality issues. There was much discussion of how institutions define just what is “good enough,” and how those decisions are reflected in the architecture of their systems. Some organizations have decided to compromise on ingest requirements as they have scaled up, while other organizations have remained quite strict about the cleanliness of content entering their archives. As the amount of unpreserved digital content continues to grow, this question of “what is sufficient” will persist as a challenge, as will the challenge of moving QA capabilities further upstream, closer to the actual producers of data.
The Scale of the Long Tail
As more and more content is both digitized and born digital, institutions are finding they must scale for increases in both resource access requests and expectations for completeness of collections.
The number of e-journals in the world that are not preserved was a recurrent theme. The exact number of journals that are not being preserved is unknown, but some facts are:
- 79% of the 100,000 serials with ISSN are not being known to be preserved anywhere. It is not know how many serials that do not have ISSNs are being preserved.
- In 2012, Cornell and Columbia University Libraries (2CUL) estimated that about 85% of e-serial content is unpreserved.
This digital “dark matter” is dwarfed in scope by existing and anticipated scientific and other research data, including that generated by sensor networks and by rich multimedia content.
Economies and Diseconomies of Scale
Perhaps the most important question raised at this workshop was the question as to whether we as a community are really at scale yet? Can we yet leverage true economies of scale? David Rosenthal noted that as we centralize more and more preserved content in fewer hands, we will be able to better leverage economies of scale, but we will also be increasing risk of a single point of failure.
The consensus of the group seemed to be that, as a whole, the digital preservation community is not yet truly at scale. However, the organizations in the room have moved beyond a project mentality and into a service oriented mentality, and are actively seeking ways to avoid wasteful duplication of effort, and to engage in active cooperation and collaboration.
Workshop presentations and notes on each presentation are available at: https://drive.google.com/folderview?id=0B1X7I2IVBtwzcGVhWUF0TmJIUms&usp=sharing
The KB participates in the Europeana Newspapers project that has started in February 2012. The project will enrich 18 million pages of digitised newspapers with Optical Character Recognition (OCR), Optical Layout Recognition (OLR) and Named Entity Recognition (NER) from all over Europe and deliver them to Europeana. The project consortium consists of 18 partners from all over Europe: some will provide (technical) support, while other will provide their digitised newspapers. The KB has two roles: we will not only deliver 2 million of our newspaper pages to Europeana, but we will also enrich ours and the newspapers of other partners with NER.
In the last months, the project has welcomed 11 new associated partners and to make sure they can benefit as much as possible from the experiences of the project partners the University Library of Belgrade and LIBER jointly organised a workshop on refinement and aggregation on 13 and 14 June. Here, the KB (Clemens Neudecker and I) presented the work that is currently being done to make sure that we will have Named Entities for several partners. To make sure that the work that is being done in the project also benefits our direct colleagues, we were joined by someone from our Digitisation department.
The workshop started with a warm welcome in Belgrade by the director of the library, Prof. Aleksandar Jerkov. After a short introduction into the project by the project leader Hans-Jörg Lieder from the State Library Berlin, Clemens Neudecker from the KB presented the refinement process of the project. All presentations will be shared on the project’s Slideshare account. The refinement of the newspapers has already started and is being done by the University of Innsbruck and the company CCS in Hamburg. However, it was still a big surprise when Hans-Jörg Lieder announced a present for the director of the University Library Belgrade; the first batch of their processed newspapers!
The day continued with an introduction into the importance of evaluation of OCR and OLR and a demonstration of the tools used for this by Stefan Pletschacher and Cristian Clausner from the University of Salford. This sparked some interesting discussions in the break-out sessions on methods of evaluation in the libraries digitising their collections. For example, do you tell your service provider what you will be checking when you receive a batch? You could argue that the service provider would then only fix what you check. On the other hand if that is what you need to reach your goal it would save a lot of time and rejected batches.
After a short getting-to-know-each-other session the whole workshop party moved to the Nikola Tesla Museum nearby where we were introduced to their newspaper clippings project. All newspaper clippings collected by Nikola Tesla are now being digitised for publication on the museum’s website. A nice tour through the museum followed with several demonstrations (don’t worry, no one was electrocuted) and the day was concluded with a dinner in the bohemian quarter.
The second day of the workshop was dedicated solely to refinement. I kicked off the day with the question ‘What is a named entity?’. This sounds easy, but can provide you with some dilemmas as well. For example, a dog’s name is a name, but do you want it to be tagged as a NE? And what do you do with a title such as Romeo and Juliet? Consistency is key in this and as long as you keep your goal in mind while training your software you should end up with the results you are looking for.
Claus Gravenhorst followed me with his presentation on OLR at CCS, by using docWorks, with which they will process 2 million pages. It was then again our turn with a hands-on session about the tools we’re using, which are also available on Github. The last session of the workshop was a collaboration between Claus Gravenhorst from CCS and Günter Mühlberger from the University of Innsbruck who gave us a nice insight into their tools and the considerations made when working with digitised newspapers. For example, how many categories would you need to tag every article?
All in all, it was a very successful workshop and I hope that all participants enjoyed it as much as I have. I at least am happy to have spoken to so many interesting people with new experiences from other digitisation projects. There is still much to learn from each other and projects like Europeana Newspapers contribute towards a good exchange of knowledge between libraries to ensure our users get the best experience when browsing through the rich digital collections.
The SurfAcademy, a program set up to encourage knowledge exchange between higher education institutions in the Netherlands, organised a seminar on MOOCs, Massive Open Online Courses, on 26 February. Several Dutch institutions have started with MOOCs on various platforms and subjects, so the special interest group Open Educational Resources (OER) of Surf thought it was time to share experiences and open up the discussion for institutions that wish to jump on this fast moving train.
The Koninklijke Bibliotheek does not normally provide education as the National Library of the Netherlands, but we do work together with the Dutch universities (of applied sciences) and we are happy to share knowledge with our colleagues and users. Also, as one of the founding members of the impact Centre of Competence in text digitisation, we were asked to think about how we can best share the knowledge that was gathered in the 4 year research project IMPACT. Perhaps a MOOC would be a good idea?
The afternoon has an ambitious program, but is filled with experiences and interesting observations. I thought the most interesting parts of the afternoon were the presentations of the universities that are currently working with MOOCs in the Netherlands. Those were LeidenUniversity, presented by Marja Verstelle, the University of Amsterdam, presented by Frank Benneker and Willem van Valkenburg on the work the Technical University Delft is doing with their MOOC.
It is interested to see the different choices each institution made for their own implementation of a MOOC. Leiden chose to work with Coursera and TU Delft joined EdX, while Amsterdam built their own platform (forever beta) in only two months and just 20k euro with a private partner. Each have their own reasons for these choices, such as flexibility (Amsterdam), openness (Delft) or ease (Leiden). Amsterdam is the only university that has started its MOOC already with great success (4800 participants in the first week), Leiden plans to start in May 2013 and Delft follows in September.
Another interesting presentation was the one by Timo Kos, both from KahnAcademy and Capgemini Consulting. He shared the results of two projects he did on OER, including MOOCs. As he showed us that MOOCs are not a technical hype, because they use no new technologies, merely combine existing ones for a new purpose. However, MOOCs can be indicated as a disruptive innovation, but as he says in the panel discussion at the end of the day we do not have to fear that real-life universities will be pushed out by MOOCs.
All in all, I thought it was a very educative day with lots of food for thought. Most presentations are unfortunately in Dutch, but can be found on the website of the Surf Academy, where you will also find the videos made during the seminar. The English presentations have been embedded or linked to in this post.
Some of the questions and insights I took home with me:
- Leiden and Amsterdam chose to create shorter videos for their MOOCs, while Delft will record regular classes. When do you choose which approach?
- Do you want to use a platform of your own or will you sign up with one of the existing ones? (Examples: Coursera, EdX, Udacity, canvas.net)
- Coursera takes 80-90% of the money made in a MOOC and they sell their user’s data to third parties. (Do have to say that I did not did a fact-check on this one!)
- Do you want to get involved in the world of MOOCs as a non-top-50 university or even as a non-educational institute? The BL will do so, by joining FutureLearn.
- PR of your MOOC is very important, especially if you use your own platform. However, getting a news item on the Dutch 8 o’clock news will probably mean one server is not enough for the first class.
- The success of a MOOC also depends on the reputation of your institution.
- Do students feel they are studying at an institute/university or at i.e. Coursera?
- Using a MOOC towards your own degree is possible when you take the exam in/with a certified testing centre, such as Pearson or ProctorU.
- If you plan to go into online education, when do you consider it a MOOC and when is it simply an online course?