KB Research

Research at the National Library of the Netherlands

Tag: centres of competence

Succeed project rated ‘Excellent’ by European Commission

Author: Silvia Ponzoda
This post is a summary. The original article is available at: http://www.digitisation.eu/blog/european-commission-rated-excellent-succeed-project-results/

The Succeed project has recently been rated ‘Excellent ‘ by the European Commission. The final evaluation of the Succeed project took place on19th of February 2015, at the University of Alicante, during a meeting of the committee of experts appointed by the European Commission (EC) with the Succeed consortium members. The meeting was chaired by Cristina Maier, Succeed Project Officer from the European Commission.
Succeed has been funded by the European Union to promote the take up and validation of research results in mass digitisation, with a focus on textual content. For a description of the project and the consortium, see our earlier post Succeed project launched.

The outputs produced by Succeed during the project life span (January 2013-December 2014) are listed below.

Continue reading

Succeed technical workshop on the interoperability of digitisation platforms

Succeed Interoperability Workshop
2 October 2014, National library of the Netherlands, The Hague

Speaking the same language is one thing, understanding what the other is saying is another… – Carl Wilson at the Succeed technical workshop on the interoperability of digitisation platforms.

Interoperability is a term widely used to describe the ability of making systems and organisations work together (inter-operate). However, interoperability is not just about the technical requirements for information exchange. In a broader definition, it also includes social, political, and organisational factors that impact system to system performance and is related to questions of (commercial) power and market dominance (See http://en.wikipedia.org/wiki/Interoperability).

On 2 October 2014, the Succeed project organised a technical workshop on the interoperability of digitisation platforms at the National library of the Netherlands in The Hague. 19 researchers, librarians, and computer scientists from several European countries participated in the workshop (see SUCCEED Interoperability Workshop_Participants). In preparation of the workshop, the Succeed project team asked participants to fill out a questionnaire containing several questions on the topic of interoperability. The questionnaire was filled out by 12 participants; the results were presented during the workshop. The programme included a number of presentations and several interactive sessions to come to a shared view on what interoperability is about, what are the main issues and barriers to be dealt with, and how we should approach these.

The main goals of the workshop were:

  1. Establishing a baseline for interoperability based on the questionnaire and presentations of the participants
  2. Formulating a common statement on the value of interoperability
  3. Defining the ideal situation with regard to interoperability
  4. Identifying the most important barriers
  5. Formulating an agenda

1. Baseline

Presentation by Carl Wilson

To establish a baseline (what is interoperability and what is its current status in relation to digitisation platforms), our programme included a number of presentations. We invited Carl Wilson of the Open Preservation Foundation (previously the Open Planets Foundation)  for the opening speech. He set the scene by sharing a number of historical examples (in IT and beyond) of interoperability issues. Carl made clear that interoperability in IT has many dimensions:

  1. Technical dimensions
    Within the technical domain, two types of interoperability can be discerned, i.e.:
    Syntactical interoperability (aligning metadata formats); “speaking the same language”, and Semantical interoperability; “understanding each other
  2. Organizational /Political dimensions
  3. Legal (IPR) dimensions
  4. Financial dimensions

When approaching operability issues, it might help to take into account these basic rules:

  • Simplicity
  • Standards
  • Clarity
  • Test early (automated testing, virtualisation)
  • Test often

Finally, Carl stressed that the importance of interoperability will further increase with the rise of the Internet of Things, as it involves more frequent information exchange between more and more devices.

The Succeed Interoperability platform

After Carl Wilson’s introductory speech, Enrique Molla from the University of Alicante (UA is project leader of the Succeed project) presented the Succeed Interoperability framework, which allows users to test and combine a number of digitisation tools. The tools are made available as web services by a number of different providers, which allows the user to try them out online without having to install any of these tools locally. The Succeed project met with a number of interoperability related issues when developing the platform. For instance, the web services have a number of different suppliers; some of them are not maintaining their services. Moreover, the providers of the web services often have commercial interests, which means that they impose limits such as a maximum number of users of pages tested through the tools.

Presentations by participants

After the demonstration of the Succeed Interoperability platform, the floor was open for the other participants, many of whom had prepared a presentation about their own project and their experience with issues of interoperability.

Bert Lemmens presented the first results of the Preforma Pre-Commercial Procurement project (running January 2014 to December 2017). A survey performed by the project made clear that (technically) open formats are in many cases not the same as libre/ open source formats. Moreover, even when standard formats are used across different projects, they are often implemented in multiple ways. And finally, when a project or institution has found their technically appropriate format, they may often find that limited support is available on how to adopt the format.

Gustavo Candela Romero gave an overview of the services provided by the Biblioteca Virtual Miguel de Cervantes (BVMC) The BVMC developed their service oriented architecture with the purpose of facilitating online access to Hispanic Culture. The BVMC offers their data as OAI-PMH, allowing other institutions or researchers to harvest their content. Moreover, the BVMC is working towards publishing their resources in RDF and making it available through a SPARQL Endpoint.

Alastair Dunning and Pavel Kats explained how Europeana and The European Library are working towards a shared storage system for aggregators with shared tools for the ingestion and mapping process. This will have practical and financial benefits, as shared tools will reduce workflow complexity, are easier to sustain and, finally, cheaper.

Clara Martínez Cantón presented the work of the Digital Humanities Innovation Lab (LINHD), the research centre on Digital Humanities at the National Distance Education University (UNED) in Spain. The LINHD encourages researchers to make use of Linked Data. Clara showed the advantages of using Linked Data in a number of research projects related to metrical repertoires. In these projects, a number of interoperability issues (such as a variety of structures of the data, different systems used, and variation in the levels of access) were by-passed by making use of a Linked Data Model.

Marc Kemps-Snijders made clear how the Meertens Institute strives to make collections and technological advances available to the research community and the general public by providing technical support and developing applications. Moreover, the Meertens Institute is involved in a number of projects related to interoperability, such as Nederlab and CLARIN.

Menzo Windhouwer further elaborated on the projects deployed by CLARIN (Common Language Resources and Technology Infrastructure). CLARIN is a European collaborative effort to create, coordinate and make language resources and technology available and readily useable. CLARIN is involved in setting technical standards and creating recommendations for specific topics. CLARIN has initiated the Component MetaData Infrastructure (CMDI), which is an integrated semantic layer to
achieve semantic interoperability and overcome the differences between different metadata structures.

Presentation of responses to the Succeed questionnaire and overview of issues mentioned

To wrap up the first part of the programme, and to present an overview of the experiences and issues described by the participants, Rafael Carrasco from the University of Alicante presented the results of the Succeed online questionnaire (see also below).

Most institutions which filled out the questionnaire made clear that they are already addressing interoperability issues. They are mainly focusing on technical aspects, such as the normalization of resources or data and the creation of an interoperable architecture and interface. The motives for striving for interoperability were threefold: there is a clear demand by users; interoperability means an improved quality of service; and interoperability through cooperation with partner institutions brings many benefits to the institutions themselves. The most important benefits mentioned were: to create a single point of access (i.e., a better service to users), and to reduce the cost of software maintenance.

Tomasz Parkola and Sieta Neuerburg proceeded by recapturing the issues presented in the presentations. Clearly, all issues mentioned by participants could be placed in one of the dimensions introduced by Carl Wilson, i.e. Technical, Organizational/ Political, Financial, or Legal.

2. What is the value of interoperability?

Having established our baseline of the current status of interoperability, the afternoon programme of the workshop further included a number of interactive sessions, which were led by Irene Haslinger of the National library of the Netherlands. To start off, we asked the participants to write down their notion of the value of interoperability.

IMG_5491

The following topics were brought up:

  • Increased synergy
  • More efficient/ effective allocation of resources
  • Cost reduction
  • Improved usability
  • Improved data accessibility

3. How would you define the ideal situation with regard to interoperability?

After defining the value of interoperability, the participants were asked to describe their ‘ideal situation’.

The participants mainly mentioned their technical ideals, such as:

  • Real time/ reliable access to data providers
  • Incentives for data publishing for researchers
  • Improved (meta)data quality
  • Use of standards
  • Ideal data model and/ or flexibility in data models
  • Only one exchange protocol
  • Automated transformation mechanism
  • Unlimited computing capacity
  • All tools are “plug and play” and as simple as possible
  • Visualization analysis

Furthermore, a number of organizational ideals was brought up:

  • The right skills reside in the right place/ with the right people
  • Brokers (machines & humans) help to achieve interoperability

IMG_5496

 

4. Identifying existing barriers

After describing the ‘ideal world’, we asked the participants to go back to reality and identify the most important barriers which – in their view – stop us from achieving the interoperability ideals described above.

In his presentation of the responses to the questionnaire, Rafael Carrasco had already identified the four issues considered to be the most important barriers for the implementation of interoperability:

  • Insufficient expertise by users
  • Insufficient documentation
  • The need to maintain and/ or adapt third party software or webservices
  • Cost of implementation

The following barriers were added by the participants:

Technical issues (in order of relevance)

  • Pace of technological developments/ evolution
  • Legacy systems
  • Persistence; permanent access to data
  • Stabilizing standards

Organizational/ Political issues (in order of relevance)

  • Communication and knowledge management
  • Lack of 21st century skills
  • No willingness to share knowledge
  • “Not invented here”-syndrome
  • Establishment of trust
  • Bridging the innovation gap; responsibility as well as robustness of tools
  • Conflicts of interest between all stakeholders (e.g. different standards)
  • Decision making/ prioritizing
  • Current (EU) funding system hinders interoperability rather than helping it (funding should support interoperability between rather than within projects)

Financial issues (in order of relevance)

  • Return of investment
  • Resources
  • Time
  • Commercial interests often go against interoperability

Legal issues

  • Issues related to Intellectual Property Rights

5. Formulate an agenda: Who should address these issues?

Having identified the most important issues and barriers, we concluded the workshop by an open discussion centering on the question: who should address these issues?

In the responses to the questionnaire, the participants had identified three main groups:

  • Standardization bodies
  • The research community
  • Software developers

During the discussion, the participants added some more concrete examples;

  • Centres of Competence established by the European Commission should facilitate standardization bodies by both influence the agenda (facilitate resources) and by helping institutions to find the right experts for the interoperability issues (and vice versa)
  • Governmental institutions, including universities and other educational institutions, should strive to improve education in “21st century skills”, to improve users’ understanding of technical issues

At the end of our workshop, we concluded that, to achieve a real impact on the implementation of interoperability, there needs to be a demand from the side of the users, while the institutionIMG_5477s and software developers need to be facilitated both organizationally and financially. Most probably, European centres of competence, such as Impact, have a role to play in this field. This is also most relevant in relation to the Succeed project. One of the project deliverables will be a Roadmap for funding Centres of Competence in work programmes. The role of Centres of Competences in relation to interoperability is one of the topics discussed in this document. As such, the results of the Succeed workshop on interoperability will be used as input for this roadmap.

We would like to thank all participants for their contribution during the workshop and look forward to working with you on interoperability issues in the future!

More pictures on Flickr

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

National Library of the Netherlands participates in Digitisation Days, Madrid, 19-20 May

On 19 and 20 May, the National Library of the Netherlands (KB) visited the Digitisation Days which were held at the Biblioteca Nacional in Madrid. The conference was supported by the European Commission, and organised by the Support Action Centre of Competence in Digitisation (Succeed) project  and the IMPACT Centre of Competence (IMPACT CoC) with the cooperation of Biblioteca Nacional de España.

For the National Library, being a collection holder, the Succeed awards ceremony was one of the highlights of the conference, because it showed the application of technology to actual collections. The Succeed awards aim to recognise successful digitisation programmes in the field of historical texts, especially those using the latest technology.

Two prizes went to the Hill Museum and Manuscript Library and the Centre d’Études Supérieures de la Renaissance, while two Commendations of Merit were awarded to the London Metropolitan Archives/ University College London  and to Tecnilógica.

In her role of member of the IMPACT CoC executive board, the KB’s Head of Research, Hildelies Balk, took part in the ceremony and awarded the Commendation of Merit to the London Metropolitan Archives/ University College London for their Great Parchment Book project. You will find a short video about the project here.[youtube=http://www.youtube.com/watch?v=WDD2cVT7PeU]

Moreover, the KB hosted an interesting and fruitful Round table workshop on the future of research and funding in digitisation and the possible roles of Centres of Competence on 20 May. Some 30 librarians and researchers joined this workshop, and discussed the below topics:

  • What research is needed to further the development of the Digital Library?
  • How can Centres of Competence assist your research or development?
  • In digitisation, are we ready to move the focus from quantity to quality?
  • What enrichments, e.g. in Named Entity Recognition, Linked Data services, or crowdsourcing for OCR correction, would be most beneficial for digitisation?
  • What’s your take on Labs and Virtual Research Environments?
  • What would you like to do in these types of research settings?
  • What do you expect to get out of them?

The preliminary outcomes of the workshop show that the main goal for institutions is to give users unrestricted access to data. During the workshop, the participants discussed the many layered aspects of these three topics, i.e. ‘users’, ‘access’, and ‘data’. Moreover, the participants gave their view on the following questions in relation to these topics:

  • What stops us from making progress?
  • What helps us to make progress?
  • And what role could CoCs play in this?

The outcomes of the workshop have been documented and will be used as a starting point for the roadmap to further development of digitisation and the digital library, which will be produced within the Succeed project. This roadmap will serve to support the European Commission in preparing the 2014–2020 Work Programme for Research and Innovation.

 

Working together to improve text digitisation techniques

2nd Succeed hackathon at the University of Alicante

https://www.flickr.com/photos/116354723@N02/13757270124/

Is there any one still out there who thinks a hackathon is a malicious break-in? Far from it. It is the best way for developers and researchers to get together and work on new tools and innovations. The 2nd developers workshop / hackathon organised on 10-11 April by the Succeed Project was a case in point: bringing together people to work on new ideas and new inspiration for better OCR. The event was held in the “Claude Shannon” aula of the Department of Software and Computing Systems (DLSI) of the University of Alicante, Spain. Claude Shannon was a famous mathematician and engineer and is also known as the “father of information theory”. So it seems like a good place to have a hackathon!

Clemens explains what a hackathon is and what we hope to achieve with it for Succeed.

Same as last year, we again provided a wiki upfront with some information about possible topics to work on, as well as a number of tools and data that participants could experiment with before and during the event. Unfortunately there was an unexpectedly high number of no-shows this time – we try to keep these events free and open to everyone, but may have to think about charging at least a no-show fee in the future, as places are usually limited. Or did those hackers simply have to stay home to fix the heartbleed bug on their servers? We will probably never find out.

Collaboration, open source tools, open solutions

Nevertheless, there was a large enough group of programmers and researchers from Germany, Poland, the Netherlands and various parts of Spain eager to immerse themselves deeply into a diverse list of topics. Already in the introduction we agreed to work on open tools and solutions, and quickly identified some areas in which open source tool support for text digitisation is still lacking (see below). Actually, one of the first things we did was to set up a local git repository, and people were pushing code samples, prototypes and interesting projects to share with the group during both days.

https://www.flickr.com/photos/116354723@N02/13775777003/

What’s the status of open source OCR?

Accordingly, Jesús Dominguez Muriel from Digibís (the company that also made http://www.digibis.com/dpla-europeana/)  started an investigation into open source OCR tools and frameworks. He made a really detailed analysis of the status of open source OCR, which you can find here. Thanks a lot for that summary, Jesús! At the end of his presentation, Jesús also suggested an “algorithm wikipedia” – I guess something similar to RosettaCode but then specifically for OCR. This would indeed be very useful to share algorithms but also implementations and prevent reinventing (or reimplementing) the wheel. Something for our new OCRpedia, perhaps?

A method for assessing OCR quality based on ngrams

As turned out on the second day, a very promising idea seemed to be using ngrams for assessing the quality of an OCR’ed text, without the need for ground truth. Well, in fact you do still need some correct text to create the ngram model, but one can use texts from e.g. Project Gutenberg or aspell for that. Two groups started to work on this: while Willem Jan Faber from the KB experimented with a simple Python script for that purpose, the group of Rafael Carrasco, Sebastian Kirch and Tomasz Parkola decided to implement this as a new feature in the Java ocrevalUAtion tool (check the work-in-progress “wip” branch).

https://www.flickr.com/photos/116354723@N02/13775774723/

Jesús in the front, Rafael, Sebastian and Tomasz discussing ngrams in the back.

Aligning text and segmentation results

Another very promising development was started by Antonio Corbi from the University of Alicante. He worked on a software to align plain text and segmentation results. The idea is to first identify all the lines in a document, segment them into words and eventually individual charcaters, and then align the character outlines with the text in the ground truth. This would allow (among other things) creating a large corpus of training material for an OCR classifier based on the more than 50,000 images with ground truth produced in the IMPACT Project, for which correct text is available, but segmentation could only be done on the level of regions. Another great feature of Antonio’s tool is that while he uses D as a programming language, he also makes use of GTK, which has the nice effect that his tool does not only work on the desktop, but also as a web application in a browser.

aligner

OCR is complicated, but don’t worry – we’re on it!

Gustavo Candela works for the Biblioteca Virtual Miguel de Cervantes, the largest Digital Library in the Spanish speaking world. Usually he is busy with Linked Data and things like FRBR, so he was happy to expand his knowledge and learn about the various processes involved in OCR and what tools and standards are commonly used. His findings: there is a lot more complexity involved in OCR than appears at first sight. And again, for some problems it would be good to have more open source tool support.

In fact, at the same time as the hackathon, at the KB in The Hague, the ‘Mining Digital Repositories‘ conference was going on where the problem of bad OCR was discussed from a scholarly perspective. And also there, the need for more open technologies and methods was apparent:

[tweet 454528200572682241 hide_thread=’true’]

Open source border detection

One of the many technologies for text digitisation that are available in the IMPACT Centre of Competence for image pre-processing is Border Removal. This technique is typically applied to remove black borders in a digital image that have been captured while scanning a document. The borders don’t contain any information, yet they take up expensive storage space, so removing the borders without removing any other relevant information from a scanned document page is a desirable thing to do. However, there is no simple open source tool or implementation for doing that at the moment. So Daniel Torregrosa from the University of Alicante started to research the topic. After some quick experiments with tools like imagemagick and unpaper, he eventually decided to work on his own algorithm. You can find the source here. Besides, he probably earns the award for the best slide in a presentation…showing us two black pixels on a white background!

A great venue

All in all, I think we can really be quite happy with these results. And indeed the University of Alicante also did a great job hosting us – there was an excellent internet connection available via cable and wifi, plenty of space and tables to discuss in groups and we were distant enough from the classrooms not to be disturbed by the students or vice versa. Also at any time there was excellent and light Spanish food – Gazpacho, Couscous with vegetables, assorted Montaditos, fresh fruit…nowadays you won’t make hackers happy with just pizza anymore! Of course there were also ice-cooled drinks and hot coffee, and rumours spread that there were also some (alcohol-free?) beers in the cooler, but (un)fortunately there is no more documentary evidence of that…

To be continued!

If you want to try out any of the software yourself, just visit our github and have go! Make sure to also take a look at the videos that were made with participants Jesús, Sebastian and Tomasz, explaining their intentions and expectations for the hackathon. And at the next hackathon, maybe we can welcome you too amongst the participants?

Succeed Project launched

Author: Clemens Neudecker
Originally posted on: http://www.openplanetsfoundation.org/blogs/2013-02-05-succeed-project-launched

The kick-off meeting of the Succeed project (http://www.succeed-project.eu) took place on Friday 1 February in Paris.

RTEmagicC_succeed.jpg

Succeed is a project coordinated by the Universidad de Alicante and supported by the European Commission with a contribution of 1.8 mio. €.


The core objective of Succeed is to promote the take-up of the research results generated by technological companies and research centres in Europe in a strategic field for Europe: digitisation and preservation of its cultural heritage.


Succeed will foster the take-up of the most recent tools and techniques by libraries, museums and archives through the organisation of meetings of experts in digitisation, competitions to evaluate techniques, technical conferences to broadcast results and through the maintenance of an online platform for the demonstration and evaluation of tools.


Succeed will contribute in this way to the coordination of efforts for the digitisation of cultural heritage and to the standardisation of procedures. It will also propose measures to the European Union to foster the dissemination of European knowledge through centres of competence in digitisation, such as Open Planets FoundationPrestoCentreAPARSEN3D-COFORM Virtual Competence Centre, and V-MusT.net.


In addition to the University of Alicante, the consortium includes the following European institutions: the National Library of the Netherlands, the Dutch Institute of Lexicology, the Fraunhofer Gesellschaft, the Poznań Supercomputing Centre, the University of Salford, the Foundation Biblioteca Virtual Miguel de Cervantes Savedra, the French National Library and the British Library.


For additional information, please contact Rafael Carrasco (Universidad de Alicante) or send an email to succeed@ua.es.


© 2018 KB Research

Theme by Anders NorenUp ↑