Extrablatt! Final Report Europeana Newspapers published!

Reblogged from http://www.europeana-newspapers.eu/final-report/ 

All things must come to an end eventually – even the Europeana Newspapers project. The good news is that in every end, there is also a new beginning. But more on this later.

After 38 months of hard, but also very fun and rewarding work with our network, the project has officially come to a close in March 2015. But as usual with such endeavours, there are still many activities which go on beyond the lifetime of the project, such as reporting and reviewing, disseminating and spreading the results as well as fostering takeup and new initiatives around the use and exploitation of said outcomes.

Continue reading

What’s happening with our digitised newspapers?

The KB has about 10 million digitised newspaper pages, ranging from 1650 until 1995. We negotiated rights to make these pages available for research and this has happened more and more over the past years. However, we thought that many of these projects might be interested in knowing what others are doing and we wanted to provide a networking opportunity for them to share their results. This is why we organised a newspapers symposium focusing on the digitised newspapers of the KB, which was a great success!

Prof. dr. Huub Wijfjes (RUG/UvA) showing word clouds used in his research.

Prof. dr. Huub Wijfjes (RUG/UvA) showing word clouds used in his research.

Continue reading

Succeed project rated ‘Excellent’ by European Commission

Author: Silvia Ponzoda
This post is a summary. The original article is available at: http://www.digitisation.eu/blog/european-commission-rated-excellent-succeed-project-results/

The Succeed project has recently been rated ‘Excellent ‘ by the European Commission. The final evaluation of the Succeed project took place on19th of February 2015, at the University of Alicante, during a meeting of the committee of experts appointed by the European Commission (EC) with the Succeed consortium members. The meeting was chaired by Cristina Maier, Succeed Project Officer from the European Commission.
Succeed has been funded by the European Union to promote the take up and validation of research results in mass digitisation, with a focus on textual content. For a description of the project and the consortium, see our earlier post Succeed project launched.

The outputs produced by Succeed during the project life span (January 2013-December 2014) are listed below.

Continue reading

Take-up of tools within the Succeed project: Implementation of the INL Lexicon Service in Delpher.nl

Author: Judith Rog

Delpher

Delpher is a joint initiative of the Meertens Institute of Research and documentation of Dutch language and culture, the university libraries of Amsterdam (UvA), Groningen, Leiden and Utrecht, and the National Library of the Netherlands, to bring together the otherwise fragmented access to digitized historical text corpora.

Delpher currently contains over 90.000 books, over 1 million newspapers, containing more than 10 million pages, over 1.5 million pages from periodicals, and 1.5 million ANP news bulletins that are all full text searchable. New content will continually be added in the coming years.

Continue reading

Succeed technical workshop on the interoperability of digitisation platforms

Succeed Interoperability Workshop
2 October 2014, National library of the Netherlands, The Hague

Speaking the same language is one thing, understanding what the other is saying is another… – Carl Wilson at the Succeed technical workshop on the interoperability of digitisation platforms.

Interoperability is a term widely used to describe the ability of making systems and organisations work together (inter-operate). However, interoperability is not just about the technical requirements for information exchange. In a broader definition, it also includes social, political, and organisational factors that impact system to system performance and is related to questions of (commercial) power and market dominance (See http://en.wikipedia.org/wiki/Interoperability).

On 2 October 2014, the Succeed project organised a technical workshop on the interoperability of digitisation platforms at the National library of the Netherlands in The Hague. 19 researchers, librarians, and computer scientists from several European countries participated in the workshop (see SUCCEED Interoperability Workshop_Participants). In preparation of the workshop, the Succeed project team asked participants to fill out a questionnaire containing several questions on the topic of interoperability. The questionnaire was filled out by 12 participants; the results were presented during the workshop. The programme included a number of presentations and several interactive sessions to come to a shared view on what interoperability is about, what are the main issues and barriers to be dealt with, and how we should approach these.

The main goals of the workshop were:

  1. Establishing a baseline for interoperability based on the questionnaire and presentations of the participants
  2. Formulating a common statement on the value of interoperability
  3. Defining the ideal situation with regard to interoperability
  4. Identifying the most important barriers
  5. Formulating an agenda

1. Baseline

Presentation by Carl Wilson

To establish a baseline (what is interoperability and what is its current status in relation to digitisation platforms), our programme included a number of presentations. We invited Carl Wilson of the Open Preservation Foundation (previously the Open Planets Foundation)  for the opening speech. He set the scene by sharing a number of historical examples (in IT and beyond) of interoperability issues. Carl made clear that interoperability in IT has many dimensions:

  1. Technical dimensions
    Within the technical domain, two types of interoperability can be discerned, i.e.:
    Syntactical interoperability (aligning metadata formats); “speaking the same language”, and Semantical interoperability; “understanding each other
  2. Organizational /Political dimensions
  3. Legal (IPR) dimensions
  4. Financial dimensions

When approaching operability issues, it might help to take into account these basic rules:

  • Simplicity
  • Standards
  • Clarity
  • Test early (automated testing, virtualisation)
  • Test often

Finally, Carl stressed that the importance of interoperability will further increase with the rise of the Internet of Things, as it involves more frequent information exchange between more and more devices.

The Succeed Interoperability platform

After Carl Wilson’s introductory speech, Enrique Molla from the University of Alicante (UA is project leader of the Succeed project) presented the Succeed Interoperability framework, which allows users to test and combine a number of digitisation tools. The tools are made available as web services by a number of different providers, which allows the user to try them out online without having to install any of these tools locally. The Succeed project met with a number of interoperability related issues when developing the platform. For instance, the web services have a number of different suppliers; some of them are not maintaining their services. Moreover, the providers of the web services often have commercial interests, which means that they impose limits such as a maximum number of users of pages tested through the tools.

Presentations by participants

After the demonstration of the Succeed Interoperability platform, the floor was open for the other participants, many of whom had prepared a presentation about their own project and their experience with issues of interoperability.

Bert Lemmens presented the first results of the Preforma Pre-Commercial Procurement project (running January 2014 to December 2017). A survey performed by the project made clear that (technically) open formats are in many cases not the same as libre/ open source formats. Moreover, even when standard formats are used across different projects, they are often implemented in multiple ways. And finally, when a project or institution has found their technically appropriate format, they may often find that limited support is available on how to adopt the format.

Gustavo Candela Romero gave an overview of the services provided by the Biblioteca Virtual Miguel de Cervantes (BVMC) The BVMC developed their service oriented architecture with the purpose of facilitating online access to Hispanic Culture. The BVMC offers their data as OAI-PMH, allowing other institutions or researchers to harvest their content. Moreover, the BVMC is working towards publishing their resources in RDF and making it available through a SPARQL Endpoint.

Alastair Dunning and Pavel Kats explained how Europeana and The European Library are working towards a shared storage system for aggregators with shared tools for the ingestion and mapping process. This will have practical and financial benefits, as shared tools will reduce workflow complexity, are easier to sustain and, finally, cheaper.

Clara Martínez Cantón presented the work of the Digital Humanities Innovation Lab (LINHD), the research centre on Digital Humanities at the National Distance Education University (UNED) in Spain. The LINHD encourages researchers to make use of Linked Data. Clara showed the advantages of using Linked Data in a number of research projects related to metrical repertoires. In these projects, a number of interoperability issues (such as a variety of structures of the data, different systems used, and variation in the levels of access) were by-passed by making use of a Linked Data Model.

Marc Kemps-Snijders made clear how the Meertens Institute strives to make collections and technological advances available to the research community and the general public by providing technical support and developing applications. Moreover, the Meertens Institute is involved in a number of projects related to interoperability, such as Nederlab and CLARIN.

Menzo Windhouwer further elaborated on the projects deployed by CLARIN (Common Language Resources and Technology Infrastructure). CLARIN is a European collaborative effort to create, coordinate and make language resources and technology available and readily useable. CLARIN is involved in setting technical standards and creating recommendations for specific topics. CLARIN has initiated the Component MetaData Infrastructure (CMDI), which is an integrated semantic layer to
achieve semantic interoperability and overcome the differences between different metadata structures.

Presentation of responses to the Succeed questionnaire and overview of issues mentioned

To wrap up the first part of the programme, and to present an overview of the experiences and issues described by the participants, Rafael Carrasco from the University of Alicante presented the results of the Succeed online questionnaire (see also below).

Most institutions which filled out the questionnaire made clear that they are already addressing interoperability issues. They are mainly focusing on technical aspects, such as the normalization of resources or data and the creation of an interoperable architecture and interface. The motives for striving for interoperability were threefold: there is a clear demand by users; interoperability means an improved quality of service; and interoperability through cooperation with partner institutions brings many benefits to the institutions themselves. The most important benefits mentioned were: to create a single point of access (i.e., a better service to users), and to reduce the cost of software maintenance.

Tomasz Parkola and Sieta Neuerburg proceeded by recapturing the issues presented in the presentations. Clearly, all issues mentioned by participants could be placed in one of the dimensions introduced by Carl Wilson, i.e. Technical, Organizational/ Political, Financial, or Legal.

2. What is the value of interoperability?

Having established our baseline of the current status of interoperability, the afternoon programme of the workshop further included a number of interactive sessions, which were led by Irene Haslinger of the National library of the Netherlands. To start off, we asked the participants to write down their notion of the value of interoperability.

IMG_5491

The following topics were brought up:

  • Increased synergy
  • More efficient/ effective allocation of resources
  • Cost reduction
  • Improved usability
  • Improved data accessibility

3. How would you define the ideal situation with regard to interoperability?

After defining the value of interoperability, the participants were asked to describe their ‘ideal situation’.

The participants mainly mentioned their technical ideals, such as:

  • Real time/ reliable access to data providers
  • Incentives for data publishing for researchers
  • Improved (meta)data quality
  • Use of standards
  • Ideal data model and/ or flexibility in data models
  • Only one exchange protocol
  • Automated transformation mechanism
  • Unlimited computing capacity
  • All tools are “plug and play” and as simple as possible
  • Visualization analysis

Furthermore, a number of organizational ideals was brought up:

  • The right skills reside in the right place/ with the right people
  • Brokers (machines & humans) help to achieve interoperability

IMG_5496

 

4. Identifying existing barriers

After describing the ‘ideal world’, we asked the participants to go back to reality and identify the most important barriers which – in their view – stop us from achieving the interoperability ideals described above.

In his presentation of the responses to the questionnaire, Rafael Carrasco had already identified the four issues considered to be the most important barriers for the implementation of interoperability:

  • Insufficient expertise by users
  • Insufficient documentation
  • The need to maintain and/ or adapt third party software or webservices
  • Cost of implementation

The following barriers were added by the participants:

Technical issues (in order of relevance)

  • Pace of technological developments/ evolution
  • Legacy systems
  • Persistence; permanent access to data
  • Stabilizing standards

Organizational/ Political issues (in order of relevance)

  • Communication and knowledge management
  • Lack of 21st century skills
  • No willingness to share knowledge
  • “Not invented here”-syndrome
  • Establishment of trust
  • Bridging the innovation gap; responsibility as well as robustness of tools
  • Conflicts of interest between all stakeholders (e.g. different standards)
  • Decision making/ prioritizing
  • Current (EU) funding system hinders interoperability rather than helping it (funding should support interoperability between rather than within projects)

Financial issues (in order of relevance)

  • Return of investment
  • Resources
  • Time
  • Commercial interests often go against interoperability

Legal issues

  • Issues related to Intellectual Property Rights

5. Formulate an agenda: Who should address these issues?

Having identified the most important issues and barriers, we concluded the workshop by an open discussion centering on the question: who should address these issues?

In the responses to the questionnaire, the participants had identified three main groups:

  • Standardization bodies
  • The research community
  • Software developers

During the discussion, the participants added some more concrete examples;

  • Centres of Competence established by the European Commission should facilitate standardization bodies by both influence the agenda (facilitate resources) and by helping institutions to find the right experts for the interoperability issues (and vice versa)
  • Governmental institutions, including universities and other educational institutions, should strive to improve education in “21st century skills”, to improve users’ understanding of technical issues

At the end of our workshop, we concluded that, to achieve a real impact on the implementation of interoperability, there needs to be a demand from the side of the users, while the institutionIMG_5477s and software developers need to be facilitated both organizationally and financially. Most probably, European centres of competence, such as Impact, have a role to play in this field. This is also most relevant in relation to the Succeed project. One of the project deliverables will be a Roadmap for funding Centres of Competence in work programmes. The role of Centres of Competences in relation to interoperability is one of the topics discussed in this document. As such, the results of the Succeed workshop on interoperability will be used as input for this roadmap.

We would like to thank all participants for their contribution during the workshop and look forward to working with you on interoperability issues in the future!

More pictures on Flickr

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

National Library of the Netherlands participates in Digitisation Days, Madrid, 19-20 May

On 19 and 20 May, the National Library of the Netherlands (KB) visited the Digitisation Days which were held at the Biblioteca Nacional in Madrid. The conference was supported by the European Commission, and organised by the Support Action Centre of Competence in Digitisation (Succeed) project  and the IMPACT Centre of Competence (IMPACT CoC) with the cooperation of Biblioteca Nacional de España.

For the National Library, being a collection holder, the Succeed awards ceremony was one of the highlights of the conference, because it showed the application of technology to actual collections. The Succeed awards aim to recognise successful digitisation programmes in the field of historical texts, especially those using the latest technology.

Two prizes went to the Hill Museum and Manuscript Library and the Centre d’Études Supérieures de la Renaissance, while two Commendations of Merit were awarded to the London Metropolitan Archives/ University College London  and to Tecnilógica.

In her role of member of the IMPACT CoC executive board, the KB’s Head of Research, Hildelies Balk, took part in the ceremony and awarded the Commendation of Merit to the London Metropolitan Archives/ University College London for their Great Parchment Book project. You will find a short video about the project here.

Moreover, the KB hosted an interesting and fruitful Round table workshop on the future of research and funding in digitisation and the possible roles of Centres of Competence on 20 May. Some 30 librarians and researchers joined this workshop, and discussed the below topics:

  • What research is needed to further the development of the Digital Library?
  • How can Centres of Competence assist your research or development?
  • In digitisation, are we ready to move the focus from quantity to quality?
  • What enrichments, e.g. in Named Entity Recognition, Linked Data services, or crowdsourcing for OCR correction, would be most beneficial for digitisation?
  • What’s your take on Labs and Virtual Research Environments?
  • What would you like to do in these types of research settings?
  • What do you expect to get out of them?

The preliminary outcomes of the workshop show that the main goal for institutions is to give users unrestricted access to data. During the workshop, the participants discussed the many layered aspects of these three topics, i.e. ‘users’, ‘access’, and ‘data’. Moreover, the participants gave their view on the following questions in relation to these topics:

  • What stops us from making progress?
  • What helps us to make progress?
  • And what role could CoCs play in this?

The outcomes of the workshop have been documented and will be used as a starting point for the roadmap to further development of digitisation and the digital library, which will be produced within the Succeed project. This roadmap will serve to support the European Commission in preparing the 2014–2020 Work Programme for Research and Innovation.

 

‘We learn so much from each other’ – Hildelies Balk about the Digitisation Days (19-20 May)

The Digitisation Days will take place in Madrid on 19-20 May. What can you expect from them and why should you go? In order to get answers to these questions we interviewed Hildelies Balk of the National Library of the Netherlands (KB), who is also a member of the executive board of the organizing insitution, the IMPACT Centre of Competence (IMPACT CoC). – Interview and photo by Inge Angevaare (see below for Dutch version)

Hildelies Balk Reading room National Library

Hildelies Balk in the National Library’s Reading Rooms

The Digitisation Days will be of interest to …?

‘Anyone who is working with digitised historical texts. These are often difficult to use because the software cannot decipher damaged originals or illegible characters. For example:

example OCR historical text

‘The software used to ‘read’ this (Dutch) text produces the following result:

VVt Venetien den 1.Junij, Anno 1618.
DJgn i f paffato te S’ aö’Jifeert mo?üen/bah
.)etgi’uotbciraetail)i.r/JtmelchontDecht
te / sbnbe bele btr felbrr geiufttceert baer bnber
eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu
enbeeemgljen bifet Cbeiiupcen berbonbru befe

‘The Dutch National Library and many other libraries are striving to make these types of historical text more usable to researchers by enhancing the quality of the OCR (optical character recognition). Since 2008, we have been involved in European projects set up to improve the usability of OCR’d texts – preferably automatically. The IMPACT Centre of Competence as well as the Digitisation Days are quite unique in that they bring together three interest groups:

  • institutions with digitised collections (libraries, archives, museums)
  • researchers working on means to improve access to digitised text (image recognition, pattern recognition, language technology)
  • companies providing products and services in the field of digitisation and OCR.

‘Representatives of all of these groups will be taking part in the Digitisation Days and they offer participants a complete overview of the state of the art in document analysis, language technology and post-correction of OCR.’

What are the most important benefits from the Centre of Competence and the Digitisation Days, in your opinion?

‘The IMPACT Centre of Competence assists heritage institutions in taking important decisions. We evaluate available tools and report about them. Evaluation software of good quality is available as well. We also provide institutions with guidance and advice in digitisation issues by answering questions such as: what would be the best tools and methods for this particular institution? What quality can you expect from a solution? And what will it cost?’

‘The Digitisation Days offer a perfect opportunity for heritage institutions to get together and share experience and knowledge on issues such as: how to embed digitisation in your institution? How to deal with providers? Also: how do we start up new projects? Where do we find funding? On the second day, those who are interested are invited to join a workshop on the topic of the research agenda for digitisation. What should be the focus for the coming years? Should we focus on quantity or quality? How can we help shapeEuropean plans and budgets?’

Now that you mention Europe: IMPACT, IMPACT Centre of Competence, SUCCEED – the announcement of the Digitisation Days is packed with acronyms. Can you give us a bit of help here??

‘IMPACT was the first European research project aimed at improving access to historical texts. It started in 2008, at the initiative of, among others, the Dutch KB. When the project ended, a number of IMPACT partners set up the IMPACT Centre of Competence to ensure that the project results would be supported and developed. The Centre is not a project, but a standing organisation.’

Succeed is another European project, and, by definition, temporary. The objectives are in line with the IMPACT CoC, and the project involves some of the same partners. The aim is raise awareness about the results of European projects related to the digital library and to stimulate implementation. Before the CoC, it was not uncommon for prototypes to be left as they were after completion of a project. Thus the investments did not pay off.’

Will you really turn theory into practice?

‘Yes, most definitely! It is our prime focus for the conference. This is why we instituted the Succeed awards which will be handed out during the Digitisation Days; the Succeed awards recognise the best implementations of innovative technologies. The board has recently announced the winners.’

What do you personally look forward to most during the Digitisation Days?

‘To meeting everybody, to bringing together all these different parties. Colleagues from other institutions, researchers – this is exactly the right kind of meeting for generating exciting ideas and solutions.’

‘We kunnen zoveel van elkaar leren’ – Hildelies Balk over de Digitisation Days (19-20 mei)

Op 19-20 mei worden in Madrid de Digitisation Days gehouden. Wat valt er te beleven en waarom zou je erheen gaan? We vroegen het Hildelies Balk van de Koninklijke Bibliotheek, die voorzitter is van het bestuur van de organisator, het IMPACT Centre of Competence (IMPACT CoC). – interview en foto Inge Angevaare

Hildelies Balk leeszaal KB

Hildelies Balk in de leeszalen van de KB

Voor wie zijn de Digitisation Days interessant?

‘Voor iedereen die te maken heeft met gedigitaliseerde, historische teksten. Die zijn vaak moeilijk bruikbaar omdat de leessoftware veel fouten maakt. Dat komt bij voorbeeld omdat het originele drukwerk zelf al slecht was, of omdat de drukletter slecht leesbaar is:

voorbeeld OCR historische tekst

‘De software die de plaatjes moet omzetten in leesbare tekst maakt daarvan:

VVt Venetien den 1.Junij, Anno 1618.
DJgn i f paffato te S’ aö’Jifeert mo?üen/bah
.)etgi’uotbciraetail)i.r/JtmelchontDecht
te / sbnbe bele btr felbrr geiufttceert baer bnber
eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu
enbeeemgljen bifet Cbeiiupcen berbonbru befe

‘De KB en andere bibliotheken willen dit soort teksten in bruikbare vorm aanbieden aan wetenschappers. Dus zoeken we al sinds 2008 in Europees verband naar methoden om de teksten te verbeteren, liefst automatisch. Het unieke aan het IMPACT Centre of Competence én van de Digitisation Days is dat daar drie belangengroepen bij elkaar komen die elkaar versterken:

  • instellingen met collecties die gedigitaliseerd zijn (bibliotheken, archieven, musea)
  • onderzoekers die methoden ontwikkelen om gedigitaliseerde tekst te verbeteren (beeldherkenning en – verbetering, patroonherkenning, taaltechnologie)
  • leveranciers van producten en diensten voor digitalisering en OCR (optical character recognition).

‘Door de aanwezigheid van al deze mensen krijgt de bezoeker in twee dagen tijd een compleet overzicht van wat er momenteel allemaal mogelijk is – op het gebied van documentanalyse, taaltechnologie en post-correctie van OCR.’

Wat zie jij als het grootste nut van het Centre of Competence en de Digitisation Days?

‘Het IMPACT Centre of Competence helpt erfgoedinstellingen belangrijke beslissingen te nemen. We evalueren bestaande tools en publiceren daarover. Er is zelfs heel goede evaluatiesoftware. En we leveren begeleiding; als een instelling wil gaan digitaliseren kunnen wij ze van advies dienen. Wat zijn de beste tools en methoden in hun specifieke geval? Wat voor kwaliteit mag je verwachten? Wat gaat het kosten?’

‘De Digitisation Days zijn een perfecte manier voor erfgoedinstellingen om elkaar te ontmoeten, uitgebreid ervaringen en kennis te delen. Bijvoorbeeld: Hoe ga je om met leveranciers? Hoe geef je digitalisering een plek in je organisatie? Maar ook: hoe zetten we nieuwe projecten op? Hoe vinden we geldstromen? Op de tweede dag is er een workshop waarin we met belangstellenden gaan praten over de onderzoeksagenda voor digitalisering. Waar moeten we de nadruk op leggen? Meer kwantiteit of meer kwaliteit? Hoe kunnen we de plannen en budgetten van Europa beïnvloeden?’

Nu je het over Europa hebt: IMPACT, IMPACT Centre of Competence, SUCCEED – de aankondiging van de Digitisation Days staat vol met afkortingen. Kun je een beetje orde scheppen in die chaos?

‘IMPACT was het eerste Europese onderzoeksproject voor verbetering van toegang tot historische teksten dat mede op initiatief van de KB in 2008 is gestart. Toen het project afgelopen was, hebben een aantal IMPACT-partners de handen ineengeslagen om ervoor te zorgen dat de resultaten van het project onderhouden en verder ontwikkeld zouden worden. Dat is het IMPACT Centre of Competence. Geen project, maar een staande organisatie.’

Succeed is weer een Europees project en dus tijdelijk. De doelstellingen liggen helemaal in lijn met het IMPACT CoC, en daarom zijn er deels dezelfde partners bij betrokken. Doel is om te zorgen dat eindresultaten van Europese projecten op het gebied van de digitale bibliotheek goed onder de aandacht worden gebracht zodat ze gebruikt gaan worden in de praktijk. In het verleden bleven prototypes nog wel eens op de plank liggen. Dat is zonde van de investering.’

Wordt de stap van theorie naar praktijk echt gezet?

‘Jazeker! Die willen we juist alle aandacht geven. Daarom reiken we tijdens de Digitisation Days de Succeed awards uit – prijzen voor de beste toepassingen van innovatieve oplossingen. De jury heeft onlangs de kandidaten en de winnaars bekend gemaakt.’

Waar verheug jijzelf je het meest op tijdens de Digitisation Days?

‘Op de ontmoeting, het bij elkaar brengen van al die belanghebbenden. Collega’s van andere instellingen, de onderzoekers – juist uit de ontmoeting komen vaak spannende ideeën en oplossingen voort.’

Working together to improve text digitisation techniques

2nd Succeed hackathon at the University of Alicante

Ready to start 2

Is there any one still out there who thinks a hackathon is a malicious break-in? Far from it. It is the best way for developers and researchers to get together and work on new tools and innovations. The 2nd developers workshop / hackathon organised on 10-11 April by the Succeed Project was a case in point: bringing together people to work on new ideas and new inspiration for better OCR. The event was held in the “Claude Shannon” aula of the Department of Software and Computing Systems (DLSI) of the University of Alicante, Spain. Claude Shannon was a famous mathematician and engineer and is also known as the “father of information theory”. So it seems like a good place to have a hackathon!

Clemens explains what a hackathon is and what we hope to achieve with it for Succeed.

Same as last year, we again provided a wiki upfront with some information about possible topics to work on, as well as a number of tools and data that participants could experiment with before and during the event. Unfortunately there was an unexpectedly high number of no-shows this time – we try to keep these events free and open to everyone, but may have to think about charging at least a no-show fee in the future, as places are usually limited. Or did those hackers simply have to stay home to fix the heartbleed bug on their servers? We will probably never find out.

Collaboration, open source tools, open solutions

Nevertheless, there was a large enough group of programmers and researchers from Germany, Poland, the Netherlands and various parts of Spain eager to immerse themselves deeply into a diverse list of topics. Already in the introduction we agreed to work on open tools and solutions, and quickly identified some areas in which open source tool support for text digitisation is still lacking (see below). Actually, one of the first things we did was to set up a local git repository, and people were pushing code samples, prototypes and interesting projects to share with the group during both days.

Second Day, April 11_5

What’s the status of open source OCR?

Accordingly, Jesús Dominguez Muriel from Digibís (the company that also made http://www.digibis.com/dpla-europeana/)  started an investigation into open source OCR tools and frameworks. He made a really detailed analysis of the status of open source OCR, which you can find here. Thanks a lot for that summary, Jesús! At the end of his presentation, Jesús also suggested an “algorithm wikipedia” – I guess something similar to RosettaCode but then specifically for OCR. This would indeed be very useful to share algorithms but also implementations and prevent reinventing (or reimplementing) the wheel. Something for our new OCRpedia, perhaps?

A method for assessing OCR quality based on ngrams

As turned out on the second day, a very promising idea seemed to be using ngrams for assessing the quality of an OCR’ed text, without the need for ground truth. Well, in fact you do still need some correct text to create the ngram model, but one can use texts from e.g. Project Gutenberg or aspell for that. Two groups started to work on this: while Willem Jan Faber from the KB experimented with a simple Python script for that purpose, the group of Rafael Carrasco, Sebastian Kirch and Tomasz Parkola decided to implement this as a new feature in the Java ocrevalUAtion tool (check the work-in-progress “wip” branch).

Second Day, April 11_4

Jesús in the front, Rafael, Sebastian and Tomasz discussing ngrams in the back.

Aligning text and segmentation results

Another very promising development was started by Antonio Corbi from the University of Alicante. He worked on a software to align plain text and segmentation results. The idea is to first identify all the lines in a document, segment them into words and eventually individual charcaters, and then align the character outlines with the text in the ground truth. This would allow (among other things) creating a large corpus of training material for an OCR classifier based on the more than 50,000 images with ground truth produced in the IMPACT Project, for which correct text is available, but segmentation could only be done on the level of regions. Another great feature of Antonio’s tool is that while he uses D as a programming language, he also makes use of GTK, which has the nice effect that his tool does not only work on the desktop, but also as a web application in a browser.

aligner

OCR is complicated, but don’t worry – we’re on it!

Gustavo Candela works for the Biblioteca Virtual Miguel de Cervantes, the largest Digital Library in the Spanish speaking world. Usually he is busy with Linked Data and things like FRBR, so he was happy to expand his knowledge and learn about the various processes involved in OCR and what tools and standards are commonly used. His findings: there is a lot more complexity involved in OCR than appears at first sight. And again, for some problems it would be good to have more open source tool support.

In fact, at the same time as the hackathon, at the KB in The Hague, the ‘Mining Digital Repositories‘ conference was going on where the problem of bad OCR was discussed from a scholarly perspective. And also there, the need for more open technologies and methods was apparent:

[tweet 454528200572682241 hide_thread=’true’]

Open source border detection

One of the many technologies for text digitisation that are available in the IMPACT Centre of Competence for image pre-processing is Border Removal. This technique is typically applied to remove black borders in a digital image that have been captured while scanning a document. The borders don’t contain any information, yet they take up expensive storage space, so removing the borders without removing any other relevant information from a scanned document page is a desirable thing to do. However, there is no simple open source tool or implementation for doing that at the moment. So Daniel Torregrosa from the University of Alicante started to research the topic. After some quick experiments with tools like imagemagick and unpaper, he eventually decided to work on his own algorithm. You can find the source here. Besides, he probably earns the award for the best slide in a presentation…showing us two black pixels on a white background!

A great venue

All in all, I think we can really be quite happy with these results. And indeed the University of Alicante also did a great job hosting us – there was an excellent internet connection available via cable and wifi, plenty of space and tables to discuss in groups and we were distant enough from the classrooms not to be disturbed by the students or vice versa. Also at any time there was excellent and light Spanish food – Gazpacho, Couscous with vegetables, assorted Montaditos, fresh fruit…nowadays you won’t make hackers happy with just pizza anymore! Of course there were also ice-cooled drinks and hot coffee, and rumours spread that there were also some (alcohol-free?) beers in the cooler, but (un)fortunately there is no more documentary evidence of that…

To be continued!

If you want to try out any of the software yourself, just visit our github and have go! Make sure to also take a look at the videos that were made with participants Jesús, Sebastian and Tomasz, explaining their intentions and expectations for the hackathon. And at the next hackathon, maybe we can welcome you too amongst the participants?

Named entity recognition for digitised historical newspapers

Europeana NewspapersThe refinement partners in the Europeana Newspapers project will produce the astonishing amount of 10 million pages of full-text from historical newspapers from all over Europe. What could be done to further enrich that full-text?

The KB National Library of the Netherlands has been investigating named entity recognition (NER) and linked data technologies for a while now in projects such as IMPACT and STITCH+, and we felt it was about time to approach this on a production scale. So we decided to produce (open source) software, trained models as well as raw training data for NER software applications specifically for digitised historical newspapers as part of the project.

What is named entity recognition (NER)?

Named entity recognition is the process of identifying and classifying entities such as persons, locations and organisations in the full-text in order to enhance searchability. There are basically two types of approaches, a statistical and a rule based one. Rule based systems rely mostly on grammar rules defined by linguists, while statistical systems require large amounts of manually produced training data that they can learn from. While both approaches have their benefits and drawbacks, we decided to go for a statistical tool, the CRFNER system from Stanford University. In comparison, this software proved to be the most reliable, and it is supported by an active user community. Stanford University has an online demo where you can try it out: http://nlp.stanford.edu:8080/ner/.

ner

Example of Wikipedia article for Albert Einstein, tagged with the Stanford NER tool

Requirements & challenges

There are some particular requirements and challenges when applying these techniques to digital historical newspapers. Since full-text for more than 10 million pages will be produced in the project, one requirement for our NER tool was that it should be able to process large amounts of texts in a rather short time. This is possible with the Stanford tool,  which as of version 1.2.8 is “thread-safe”, i.e. it can run in parallel on a multi-core machine. Another requirement was to preserve the information about where on a page a named entity has been detected – based on coordinates. This is particularly important for newspapers: instead of having to go through all the articles on a newspaper page to find the named entity, it can be highlighted so that one can easily spot it even on very dense pages.

Then there are also challenges of course – mainly due to the quality of the OCR and the historical spelling that is found in many of these old newspapers. In the course of 2014 we will thus collaborate with the Dutch Institute for Lexicology (INL), who have produced modules which can be used in a pre-processing step before the Stanford system and that can to some extent mitigate problems caused by low quality of the full-text or the appearance of historical spelling variants.

The Europeana Newspapers NER workflow

For Europeana Newspapers, we decided to focus on three languages: Dutch, French and German. The content in these three languages makes up for about half of the newspaper pages that will become available through Europeana Newspapers. For the French materials, we cooperate with LIP6-ACASA, for Dutch again with INL. The workflow goes like this:

  1. We receive OCR results in ALTO format (or METS/MPEG21-DIDL containers)
  2. We process the OCR with our NER software to derive a pre-tagged corpus
  3. We upload the pre-tagged corpus into an online Attestation Tool (provided by INL)
  4. Within the Attestation Tool, the libraries make corrections and add tags until we arrive at a “gold corpus”, i.e. all named entities on the pages have been manually marked
  5. We train our NER software based on the gold corpus derived in step (4)
  6. We process the OCR again with our NER software trained on the gold corpus
  7. We repeat steps (2) – (6) until the results of the tagging won’t improve any further

    NER slide

    Screenshot of the NER Attestation Tool

Preliminary results

Named entity recognition is typically evaluated by means of Precision/Recall and F-measure. Precision gives an account of how many of the named entities that the software found are in fact named entities of the correct type, while Recall states how many of the total amount of named entities present have been detected by the software. The F-measure then combines both scores into a weighted average between 0 – 1. Here are our (preliminary) results for Dutch so far:

Dutch

Persons

Locations

Organizations

Precision

0.940

0.950

0.942

Recall

0.588

0.760

0.559

F-measure

0.689

0.838

0.671

These figures have been derived from a k-fold cross-evaluation of 25 out of 100 manually tagged pages of Dutch newspapers from the KB. The results confirm the fact that the Stanford system tends to be a bit “conservative”, i.e. it has a somewhat lower recall for the benefit of higher precision, which is also what we wanted.

Conclusion and outlook

Within this final year of the project we are looking forward to see in how far we can still boost these figures by adopting the extra modules from INL, and what results we can achieve on the French and German newspapers. We will also investigate software for linking the named entities to additional online resource descriptions and authority files such as DBPedia or VIAF to create Linked Data. The crucial question will be how well we can disambiguate the named entities and find a correct match in these resources. Besides, if there is time, we would also want to experiment with NER in other languages, such as Serbian or Latvian. And, if all goes well, you might already hear more about this at the upcoming IFLA newspapers conference “Digital transformation and the changing role of news media in the 21st Century“.

References