The KB has about 10 million digitised newspaper pages, ranging from 1650 until 1995. We negotiated rights to make these pages available for research and this has happened more and more over the past years. However, we thought that many of these projects might be interested in knowing what others are doing and we wanted to provide a networking opportunity for them to share their results. This is why we organised a newspapers symposium focusing on the digitised newspapers of the KB, which was a great success!
Author: Silvia Ponzoda
This post is a summary. The original article is available at: http://www.digitisation.eu/blog/european-commission-rated-excellent-succeed-project-results/
The Succeed project has recently been rated ‘Excellent ‘ by the European Commission. The final evaluation of the Succeed project took place on19th of February 2015, at the University of Alicante, during a meeting of the committee of experts appointed by the European Commission (EC) with the Succeed consortium members. The meeting was chaired by Cristina Maier, Succeed Project Officer from the European Commission.
Succeed has been funded by the European Union to promote the take up and validation of research results in mass digitisation, with a focus on textual content. For a description of the project and the consortium, see our earlier post Succeed project launched.
The outputs produced by Succeed during the project life span (January 2013-December 2014) are listed below.
Author: Judith Rog
Delpher is a joint initiative of the Meertens Institute of Research and documentation of Dutch language and culture, the university libraries of Amsterdam (UvA), Groningen, Leiden and Utrecht, and the National Library of the Netherlands, to bring together the otherwise fragmented access to digitized historical text corpora.
Delpher currently contains over 90.000 books, over 1 million newspapers, containing more than 10 million pages, over 1.5 million pages from periodicals, and 1.5 million ANP news bulletins that are all full text searchable. New content will continually be added in the coming years.
The Digitisation Days will take place in Madrid on 19-20 May. What can you expect from them and why should you go? In order to get answers to these questions we interviewed Hildelies Balk of the National Library of the Netherlands (KB), who is also a member of the executive board of the organizing insitution, the IMPACT Centre of Competence (IMPACT CoC). – Interview and photo by Inge Angevaare (see below for Dutch version)
The Digitisation Days will be of interest to …?
‘Anyone who is working with digitised historical texts. These are often difficult to use because the software cannot decipher damaged originals or illegible characters. For example:
‘The software used to ‘read’ this (Dutch) text produces the following result:
VVt Venetien den 1.Junij, Anno 1618.
DJgn i f paffato te S’ aö’Jifeert mo?üen/bah
te / sbnbe bele btr felbrr geiufttceert baer bnber
eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu
enbeeemgljen bifet Cbeiiupcen berbonbru befe
‘The Dutch National Library and many other libraries are striving to make these types of historical text more usable to researchers by enhancing the quality of the OCR (optical character recognition). Since 2008, we have been involved in European projects set up to improve the usability of OCR’d texts – preferably automatically. The IMPACT Centre of Competence as well as the Digitisation Days are quite unique in that they bring together three interest groups:
- institutions with digitised collections (libraries, archives, museums)
- researchers working on means to improve access to digitised text (image recognition, pattern recognition, language technology)
- companies providing products and services in the field of digitisation and OCR.
‘Representatives of all of these groups will be taking part in the Digitisation Days and they offer participants a complete overview of the state of the art in document analysis, language technology and post-correction of OCR.’
What are the most important benefits from the Centre of Competence and the Digitisation Days, in your opinion?
‘The IMPACT Centre of Competence assists heritage institutions in taking important decisions. We evaluate available tools and report about them. Evaluation software of good quality is available as well. We also provide institutions with guidance and advice in digitisation issues by answering questions such as: what would be the best tools and methods for this particular institution? What quality can you expect from a solution? And what will it cost?’
‘The Digitisation Days offer a perfect opportunity for heritage institutions to get together and share experience and knowledge on issues such as: how to embed digitisation in your institution? How to deal with providers? Also: how do we start up new projects? Where do we find funding? On the second day, those who are interested are invited to join a workshop on the topic of the research agenda for digitisation. What should be the focus for the coming years? Should we focus on quantity or quality? How can we help shapeEuropean plans and budgets?’
Now that you mention Europe: IMPACT, IMPACT Centre of Competence, SUCCEED – the announcement of the Digitisation Days is packed with acronyms. Can you give us a bit of help here??
‘IMPACT was the first European research project aimed at improving access to historical texts. It started in 2008, at the initiative of, among others, the Dutch KB. When the project ended, a number of IMPACT partners set up the IMPACT Centre of Competence to ensure that the project results would be supported and developed. The Centre is not a project, but a standing organisation.’
‘Succeed is another European project, and, by definition, temporary. The objectives are in line with the IMPACT CoC, and the project involves some of the same partners. The aim is raise awareness about the results of European projects related to the digital library and to stimulate implementation. Before the CoC, it was not uncommon for prototypes to be left as they were after completion of a project. Thus the investments did not pay off.’
Will you really turn theory into practice?
‘Yes, most definitely! It is our prime focus for the conference. This is why we instituted the Succeed awards which will be handed out during the Digitisation Days; the Succeed awards recognise the best implementations of innovative technologies. The board has recently announced the winners.’
What do you personally look forward to most during the Digitisation Days?
‘To meeting everybody, to bringing together all these different parties. Colleagues from other institutions, researchers – this is exactly the right kind of meeting for generating exciting ideas and solutions.’
Op 19-20 mei worden in Madrid de Digitisation Days gehouden. Wat valt er te beleven en waarom zou je erheen gaan? We vroegen het Hildelies Balk van de Koninklijke Bibliotheek, die voorzitter is van het bestuur van de organisator, het IMPACT Centre of Competence (IMPACT CoC). – interview en foto Inge Angevaare
Voor wie zijn de Digitisation Days interessant?
‘Voor iedereen die te maken heeft met gedigitaliseerde, historische teksten. Die zijn vaak moeilijk bruikbaar omdat de leessoftware veel fouten maakt. Dat komt bij voorbeeld omdat het originele drukwerk zelf al slecht was, of omdat de drukletter slecht leesbaar is:
‘De software die de plaatjes moet omzetten in leesbare tekst maakt daarvan:
VVt Venetien den 1.Junij, Anno 1618.
DJgn i f paffato te S’ aö’Jifeert mo?üen/bah
te / sbnbe bele btr felbrr geiufttceert baer bnber
eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu
enbeeemgljen bifet Cbeiiupcen berbonbru befe
‘De KB en andere bibliotheken willen dit soort teksten in bruikbare vorm aanbieden aan wetenschappers. Dus zoeken we al sinds 2008 in Europees verband naar methoden om de teksten te verbeteren, liefst automatisch. Het unieke aan het IMPACT Centre of Competence én van de Digitisation Days is dat daar drie belangengroepen bij elkaar komen die elkaar versterken:
- instellingen met collecties die gedigitaliseerd zijn (bibliotheken, archieven, musea)
- onderzoekers die methoden ontwikkelen om gedigitaliseerde tekst te verbeteren (beeldherkenning en – verbetering, patroonherkenning, taaltechnologie)
- leveranciers van producten en diensten voor digitalisering en OCR (optical character recognition).
‘Door de aanwezigheid van al deze mensen krijgt de bezoeker in twee dagen tijd een compleet overzicht van wat er momenteel allemaal mogelijk is – op het gebied van documentanalyse, taaltechnologie en post-correctie van OCR.’
Wat zie jij als het grootste nut van het Centre of Competence en de Digitisation Days?
‘Het IMPACT Centre of Competence helpt erfgoedinstellingen belangrijke beslissingen te nemen. We evalueren bestaande tools en publiceren daarover. Er is zelfs heel goede evaluatiesoftware. En we leveren begeleiding; als een instelling wil gaan digitaliseren kunnen wij ze van advies dienen. Wat zijn de beste tools en methoden in hun specifieke geval? Wat voor kwaliteit mag je verwachten? Wat gaat het kosten?’
‘De Digitisation Days zijn een perfecte manier voor erfgoedinstellingen om elkaar te ontmoeten, uitgebreid ervaringen en kennis te delen. Bijvoorbeeld: Hoe ga je om met leveranciers? Hoe geef je digitalisering een plek in je organisatie? Maar ook: hoe zetten we nieuwe projecten op? Hoe vinden we geldstromen? Op de tweede dag is er een workshop waarin we met belangstellenden gaan praten over de onderzoeksagenda voor digitalisering. Waar moeten we de nadruk op leggen? Meer kwantiteit of meer kwaliteit? Hoe kunnen we de plannen en budgetten van Europa beïnvloeden?’
Nu je het over Europa hebt: IMPACT, IMPACT Centre of Competence, SUCCEED – de aankondiging van de Digitisation Days staat vol met afkortingen. Kun je een beetje orde scheppen in die chaos?
‘IMPACT was het eerste Europese onderzoeksproject voor verbetering van toegang tot historische teksten dat mede op initiatief van de KB in 2008 is gestart. Toen het project afgelopen was, hebben een aantal IMPACT-partners de handen ineengeslagen om ervoor te zorgen dat de resultaten van het project onderhouden en verder ontwikkeld zouden worden. Dat is het IMPACT Centre of Competence. Geen project, maar een staande organisatie.’
‘Succeed is weer een Europees project en dus tijdelijk. De doelstellingen liggen helemaal in lijn met het IMPACT CoC, en daarom zijn er deels dezelfde partners bij betrokken. Doel is om te zorgen dat eindresultaten van Europese projecten op het gebied van de digitale bibliotheek goed onder de aandacht worden gebracht zodat ze gebruikt gaan worden in de praktijk. In het verleden bleven prototypes nog wel eens op de plank liggen. Dat is zonde van de investering.’
Wordt de stap van theorie naar praktijk echt gezet?
‘Jazeker! Die willen we juist alle aandacht geven. Daarom reiken we tijdens de Digitisation Days de Succeed awards uit – prijzen voor de beste toepassingen van innovatieve oplossingen. De jury heeft onlangs de kandidaten en de winnaars bekend gemaakt.’
Waar verheug jijzelf je het meest op tijdens de Digitisation Days?
‘Op de ontmoeting, het bij elkaar brengen van al die belanghebbenden. Collega’s van andere instellingen, de onderzoekers – juist uit de ontmoeting komen vaak spannende ideeën en oplossingen voort.’
2nd Succeed hackathon at the University of Alicante
Is there any one still out there who thinks a hackathon is a malicious break-in? Far from it. It is the best way for developers and researchers to get together and work on new tools and innovations. The 2nd developers workshop / hackathon organised on 10-11 April by the Succeed Project was a case in point: bringing together people to work on new ideas and new inspiration for better OCR. The event was held in the “Claude Shannon” aula of the Department of Software and Computing Systems (DLSI) of the University of Alicante, Spain. Claude Shannon was a famous mathematician and engineer and is also known as the “father of information theory”. So it seems like a good place to have a hackathon!
Clemens explains what a hackathon is and what we hope to achieve with it for Succeed.
Same as last year, we again provided a wiki upfront with some information about possible topics to work on, as well as a number of tools and data that participants could experiment with before and during the event. Unfortunately there was an unexpectedly high number of no-shows this time – we try to keep these events free and open to everyone, but may have to think about charging at least a no-show fee in the future, as places are usually limited. Or did those hackers simply have to stay home to fix the heartbleed bug on their servers? We will probably never find out.
Collaboration, open source tools, open solutions
Nevertheless, there was a large enough group of programmers and researchers from Germany, Poland, the Netherlands and various parts of Spain eager to immerse themselves deeply into a diverse list of topics. Already in the introduction we agreed to work on open tools and solutions, and quickly identified some areas in which open source tool support for text digitisation is still lacking (see below). Actually, one of the first things we did was to set up a local git repository, and people were pushing code samples, prototypes and interesting projects to share with the group during both days.
What’s the status of open source OCR?
Accordingly, Jesús Dominguez Muriel from Digibís (the company that also made http://www.digibis.com/dpla-europeana/) started an investigation into open source OCR tools and frameworks. He made a really detailed analysis of the status of open source OCR, which you can find here. Thanks a lot for that summary, Jesús! At the end of his presentation, Jesús also suggested an “algorithm wikipedia” – I guess something similar to RosettaCode but then specifically for OCR. This would indeed be very useful to share algorithms but also implementations and prevent reinventing (or reimplementing) the wheel. Something for our new OCRpedia, perhaps?
A method for assessing OCR quality based on ngrams
As turned out on the second day, a very promising idea seemed to be using ngrams for assessing the quality of an OCR’ed text, without the need for ground truth. Well, in fact you do still need some correct text to create the ngram model, but one can use texts from e.g. Project Gutenberg or aspell for that. Two groups started to work on this: while Willem Jan Faber from the KB experimented with a simple Python script for that purpose, the group of Rafael Carrasco, Sebastian Kirch and Tomasz Parkola decided to implement this as a new feature in the Java ocrevalUAtion tool (check the work-in-progress “wip” branch).
Jesús in the front, Rafael, Sebastian and Tomasz discussing ngrams in the back.
Aligning text and segmentation results
Another very promising development was started by Antonio Corbi from the University of Alicante. He worked on a software to align plain text and segmentation results. The idea is to first identify all the lines in a document, segment them into words and eventually individual charcaters, and then align the character outlines with the text in the ground truth. This would allow (among other things) creating a large corpus of training material for an OCR classifier based on the more than 50,000 images with ground truth produced in the IMPACT Project, for which correct text is available, but segmentation could only be done on the level of regions. Another great feature of Antonio’s tool is that while he uses D as a programming language, he also makes use of GTK, which has the nice effect that his tool does not only work on the desktop, but also as a web application in a browser.
OCR is complicated, but don’t worry – we’re on it!
Gustavo Candela works for the Biblioteca Virtual Miguel de Cervantes, the largest Digital Library in the Spanish speaking world. Usually he is busy with Linked Data and things like FRBR, so he was happy to expand his knowledge and learn about the various processes involved in OCR and what tools and standards are commonly used. His findings: there is a lot more complexity involved in OCR than appears at first sight. And again, for some problems it would be good to have more open source tool support.
In fact, at the same time as the hackathon, at the KB in The Hague, the ‘Mining Digital Repositories‘ conference was going on where the problem of bad OCR was discussed from a scholarly perspective. And also there, the need for more open technologies and methods was apparent:[tweet 454528200572682241 hide_thread=’true’]
Open source border detection
One of the many technologies for text digitisation that are available in the IMPACT Centre of Competence for image pre-processing is Border Removal. This technique is typically applied to remove black borders in a digital image that have been captured while scanning a document. The borders don’t contain any information, yet they take up expensive storage space, so removing the borders without removing any other relevant information from a scanned document page is a desirable thing to do. However, there is no simple open source tool or implementation for doing that at the moment. So Daniel Torregrosa from the University of Alicante started to research the topic. After some quick experiments with tools like imagemagick and unpaper, he eventually decided to work on his own algorithm. You can find the source here. Besides, he probably earns the award for the best slide in a presentation…showing us two black pixels on a white background!
A great venue
All in all, I think we can really be quite happy with these results. And indeed the University of Alicante also did a great job hosting us – there was an excellent internet connection available via cable and wifi, plenty of space and tables to discuss in groups and we were distant enough from the classrooms not to be disturbed by the students or vice versa. Also at any time there was excellent and light Spanish food – Gazpacho, Couscous with vegetables, assorted Montaditos, fresh fruit…nowadays you won’t make hackers happy with just pizza anymore! Of course there were also ice-cooled drinks and hot coffee, and rumours spread that there were also some (alcohol-free?) beers in the cooler, but (un)fortunately there is no more documentary evidence of that…
To be continued!
If you want to try out any of the software yourself, just visit our github and have go! Make sure to also take a look at the videos that were made with participants Jesús, Sebastian and Tomasz, explaining their intentions and expectations for the hackathon. And at the next hackathon, maybe we can welcome you too amongst the participants?
Libraries want to understand the researchers who use their digital collections and researchers want to understand the nature of these collections better. The seminar ‘Mining digital repositories’ brought them together at the Dutch Koninklijke Bibliotheek (KB) on 10-11 April, 2014, to discuss both the good and the bad of working with digitised collections – especially newspapers. And to look ahead at what a ‘digital utopia’ might look like. One easy point to agree on: it would be a world with less restrictive copyright laws. And a world where digital ‘portals’ are transformed into ‘platforms’ where researchers can freely ‘tinker’ with the digital data. – Report & photographs by Inge Angevaare, KB.
Libraries and researchers: a changing relationship
‘A lot has changed in recent years,’ Arjan van Hessen of the University of Twente and the CLARIN project told me. ‘Ten years ago someone might have suggested that perhaps we should talk to the KB. Now we are practically in bed together.’
But each relationship has its difficult moments. Researchers are not happy when they discover gaps in the data on offer, such as missing issues or volumes of newspapers. Or incomprehensible transcriptions of texts because of inadequate OCR (optical character recognition). Conference organisers Toine Pieters and Jaap Verheul (University of Utrecht) invited Hans-Jorg Lieder of the Berlin State Library to explain why he ‘could not give researchers everything everywhere today’.
Lieder & Thomas: ‘Digitising newspapers is difficult’
Both Deborah Thomas of the Library of Congress and Hans-Jorg Lieder stressed how complicated it is to digitise historical newspapers. ‘OCR does not recognise the layout in columns, or the “continued on page 5”. Plus the originals are often in a bad state – brittle and sometimes torn paper, or they are bound in such a way that text is lost in the middle. And there are all these different fonts, e.g., Gothic script in German, and the well-known long-s/f confusion.’ Lieder provided the ultimate proof of how difficult digitising newspapers is: ‘Google only digitises books, they don’t touch newspapers.’
Another thing researchers should be aware of: ‘Texts are liquid things. Libraries enrich and annotate texts, versions may differ.’ Libraries do their best to connect and cluster collections of newspapers (e.g., in the Europeana Newspapers), but ‘the truth of the matter is that most newspapers collections are still analogue; at this moment we have only bits and pieces in digital form, and there is a lot of bad OCR.’ There is no question that libraries are working on improving the situation, but funding is always a problem. And the choices to be made with bad OCR are sometimes difficult: Should we manually correct it all, or maybe retype it, or maybe even wait a couple of years for OCR technology to improve?’
Researchers: how to mine for meaning
Researchers themselves are debating how they can fit these new digital resources into their academic work. Obviously, being able to search millions of newspaper pages from different countries in a matter of days opens up a lot of new research possibilities. Conference organisers Toine Pieters and Jaap Verheul (University of Utrecht) are both involved in the HERA Translantis project which is taking a break from traditional ‘national’ historical research by looking at transnational influences of so-called ‘reference cultures’:
In the 17th century the Dutch Republic was such a reference culture. In the 20th century the United States developed into a reference culture and Translantis digs deep into the digital newspaper archives of the Netherlands, the UK, Belgium and Germany to try and find out how the United States is depicted in public discourse:
Joris van Eijnatten introduced another transnational HERA project, ASYMENC, which is exploring cultural aspects of European identity with digital humanities methodologies.
All of this sounds straightforward enough, but researchers themselves have yet to develop a scholarly culture around the new resources:
- What type of research questions do the digital collections allow? Are these new questions or just old questions to be researched in a new way?
- What is scientific ‘proof’ if the collections you mine have big gaps and faulty OCR?
- How to interpret the findings? You can search words and combinations of words in digital repositories, but how can you assess what the words mean? Meanings change over time. Also: how can you distinguish between irony and seriousness?
- How do you know that a repository is trustworthy?
- How to deal with language barriers in transnational research? Mere translations of concepts do not reflect the sentiment behind the words.
- How can we analyse what newspapers do not discuss (also known as the ‘Voldemort’ phenomenon)?
- How sustainable is digital content? Long-term storage of digital objects is uncertain and expensive. (Microfilms are much easier to keep, but then again, they do not allow for text mining …)
- How do available tools influence research questions?
- Researchers need a better understanding of text mining per se.
Some humanities scholars have yet to be convinced of the need to go digital
Rens Bod, Director of the Dutch Centre for Digital Humanities enthusiastically presented his ideas about the value of parsing (analysing parts of speech) for uncovering deep patterns in digital repositories. If you want to know more: Bod recently published a book about it.
But in the context of this blog, his remarks about the lack of big data awareness and competencies among many humanities scholars, including young students, was perhaps more striking. The University of Amsterdam offers a crash course in working with digital data to bridge the gap. The one-week, free course, deals with all aspects of working with data, from ‘gathering data’ to ‘cooking data’.
As the scholarly dimensions of working with big data are not this blogger’s expertise, I will not delve into these further but gladly refer you to an article Toine Pieters and Jaap Verheul are writing about the scholarly outcomes of the conference [I will insert a link when it becomes available].
More data providers: the ‘bad’ guys in the room
It was the commercial data providers in the room themselves that spoke of ‘bad guys’ or ‘bogey man’ – an image both Ray Abruzzi of Cengage Learning/Gale and Elaine Collins of DC Thomson Family History were hoping to at least soften a bit. Both companies provide huge quantities of digitised material. And, yes, they are in it for the money, which would account for their bogeyman image. But, they both stressed, everybody benefits from their efforts:
Cengage Learning is putting 25-30 million pages online annually. Thomson is digitising 750 million (!) newspaper & periodical pages for the British Library. Collins: ‘We take the risk, we do all the work, in exchange for certain rights.’ If you want to access the archive, you have to pay.
In and of itself, this is quite understandable. Public funding just doesn’t cut it when you are talking billions of pages. Both the KB’s Hans Jansen and Rens Bod (U. of Amsterdam) stressed the need for public/private partnerships in digitisation projects.
Elaine Collins readily admitted that researchers ‘are not our most lucrative stakeholders’; that most of Thomson’s revenue comes from genealogists and the general public. So why not give digital humanities scholars free access to their resources for research purposes, if need be under the strictest conditions that the information does not go anywhere else? Both Abruzzi and Collins admitted that such restricted access is difficult to organise. ‘And once the data are out there, our entire investment is gone.’
Libraries to mediate access?
Perhaps, Ray Abruzzi allowed, access to certain types of data, e.g., metadata, could be allowed under certain conditions, but, he stressed, individual scholars who apply to Cengage for access do not stand a chance. Their requests for data are far too varied for Cengage to have any kind of business proposition. And there is the trust issue. Abruzzi recommended that researchers turn to libraries to mediate access to certain content. If libraries give certain guarantees, then perhaps …
What do researchers want from libraries?
More data, of course, including more contemporary data (… ah, but copyright …)
And better quality OCR, please.
What if libraries have to choose between quality and quantity? That is when things get tricky, because the answer would depend on the researcher you question. Some may choose quantity, others quality.
Should libraries build tools for analysing content? The researchers in the room seemed to agree that libraries should concentrate on data rather than tools. Tools are very temporary, and researchers often need to build the tools around their specific research questions.
But it would be nice if libraries started allowing users to upload enrichments to the content, such as better OCR transcriptions and/or metadata.
And there is one more urgent request: that libraries become more transparent in what is in their collections and what is not. And be more open about the quality of the OCR in the collections. Take, e.g., the new Dutch national search service Delpher. A great project, but scholars must know exactly what’s in it and what’s not for their findings to have any meaning. And for scientific validity they must be able to reconstruct such information in retrospect. So a full historical overview of what is being added at what time would be a valuable addition to Delpher. (I shall personally communicate this request to the Delpher people, who are, I may add, working very hard to implement user requests).
New to the library: labs for researchers
Deborah Thomas of the Library of Congress made no bones about her organisation’s strategy towards researchers: We put out the content, and you do with it whatever you want. In addition to API’s (Application Protocol Interfaces), the Library is also allowing for downloads of bulk content. The basic content is available free of charge, but additional metadata levels may come at a price.
The British Library (BL) is taking a more active approach. The BL’s James Baker explained how the BL is trying to bridge the gap between researchers and content by providing special labs for researchers. As I (unfortunately!) missed that parallel session, let me mention the KB’s own efforts to set up a KB lab where researchers are invited to experiment with KB data making use of open source tools. The lab is still in its ‘pre-beta phase’ as Hildelies Balk of the KB explained. If you want the full story, by all means attend the Digital Humanities Benelux Conference in the Hague on 12-13 June, where Steven Claeyssens and Clemens Neudecker of the KB are scheduled to launch the beta-version of the platform. Here is a sneak preview of the lab, a scansion machine built by KB Data Services in collaboration with phonologist Marc van Oostendorp (audio in Dutch):
Europeana: the aggregator
“Portals are for visiting; platforms are for building on.”
Another effort by libraries to facilitate transnational research is the aggregation of their content in Europeana, especially Europeana Newspapers. For the time being the metadata are being aggregated, but in Alistair Dunning‘s vision, Europeana will grow from an end-user portal into a data brain, a cloud platform that will include the content and allow for metadata enrichment:
Dunning also indicated that Europeana might develop brokerage services to clear content for non-commercial purposes. In a recent interview Toine Pieters said that researchers would welcome Europeana to take such a role, ‘because individual researchers should not be bothered with all these access/copyright issues.’ In the United States, the Library of Congress is not contemplating a move in that direction, Deborah Thomas told her audience. ‘It is not our mission to negotiate with publishers.’ And recent ‘Mickey Mouse’ legislation, said to have been inspired by Disney interests, seems to be leading to less rather than more access.
Dreaming of digital utopias
What would a digital utopia look like for the conference attendees? Jaap Verheul invited his guests to dream of what they would do if they were granted, say, €100 million to spend as they pleased.
Deborah Thomas of the Library of Congress would put her money into partnerships with commercial companies to digitise more material, especially the post-1922 stuff (less restrictive copyright laws being part and parcel of the dream). And she would build facilities for uploading enrichments to the data.
James Baker of the British Library would put his money into the labs for researchers.
Researcher Julia Noordegraaf of the University of Amsterdam (heritage and digital culture) would rather put the money towards improving OCR quality.
Joris van Eijnatten’s dream took the Europeana plans a few steps further. His dream would be of a ‘Globiana 5.0’ – a worldwide, transnational repository filled with material in standardised formats, connected to bilingual and multilingual dictionaries and researched by a network of multilingual, big data-savvy researchers. In this context, he suggested that ‘Google-like companies might not be such a bad thing’ in terms of sustainability and standardisation.
At the end of the two-day workshop, perhaps not all of the ambitious agenda had been covered. But, then again, nobody had expected that.
The trick is for providers and researchers to keep talking and conquer this ‘unruly’ Wild West of digital humanities bit by bit, step by step.
And, by all means, allow researchers to ‘tinker’ with the data. Verheul: ‘There is a certain serendipity in working with big data that allows for playfulness.’
- Workshop programme and most slides are available at http://translantis.nl/, tab Conferences, Mining digital repositories (there does not seem to be a more specific link to the proper page).
- James Baker’s (British Library) miniblog
- Twitter hash tag #digrep14
- Arjan van Hessen’s blog at CLARIN
- A recent interview with Toine Pieters at the Europeana Newspapers site
The refinement partners in the Europeana Newspapers project will produce the astonishing amount of 10 million pages of full-text from historical newspapers from all over Europe. What could be done to further enrich that full-text?
The KB National Library of the Netherlands has been investigating named entity recognition (NER) and linked data technologies for a while now in projects such as IMPACT and STITCH+, and we felt it was about time to approach this on a production scale. So we decided to produce (open source) software, trained models as well as raw training data for NER software applications specifically for digitised historical newspapers as part of the project.
What is named entity recognition (NER)?
Named entity recognition is the process of identifying and classifying entities such as persons, locations and organisations in the full-text in order to enhance searchability. There are basically two types of approaches, a statistical and a rule based one. Rule based systems rely mostly on grammar rules defined by linguists, while statistical systems require large amounts of manually produced training data that they can learn from. While both approaches have their benefits and drawbacks, we decided to go for a statistical tool, the CRF–NER system from Stanford University. In comparison, this software proved to be the most reliable, and it is supported by an active user community. Stanford University has an online demo where you can try it out: http://nlp.stanford.edu:8080/ner/.
Requirements & challenges
There are some particular requirements and challenges when applying these techniques to digital historical newspapers. Since full-text for more than 10 million pages will be produced in the project, one requirement for our NER tool was that it should be able to process large amounts of texts in a rather short time. This is possible with the Stanford tool, which as of version 1.2.8 is “thread-safe”, i.e. it can run in parallel on a multi-core machine. Another requirement was to preserve the information about where on a page a named entity has been detected – based on coordinates. This is particularly important for newspapers: instead of having to go through all the articles on a newspaper page to find the named entity, it can be highlighted so that one can easily spot it even on very dense pages.
Then there are also challenges of course – mainly due to the quality of the OCR and the historical spelling that is found in many of these old newspapers. In the course of 2014 we will thus collaborate with the Dutch Institute for Lexicology (INL), who have produced modules which can be used in a pre-processing step before the Stanford system and that can to some extent mitigate problems caused by low quality of the full-text or the appearance of historical spelling variants.
The Europeana Newspapers NER workflow
For Europeana Newspapers, we decided to focus on three languages: Dutch, French and German. The content in these three languages makes up for about half of the newspaper pages that will become available through Europeana Newspapers. For the French materials, we cooperate with LIP6-ACASA, for Dutch again with INL. The workflow goes like this:
- We receive OCR results in ALTO format (or METS/MPEG21-DIDL containers)
- We process the OCR with our NER software to derive a pre-tagged corpus
- We upload the pre-tagged corpus into an online Attestation Tool (provided by INL)
- Within the Attestation Tool, the libraries make corrections and add tags until we arrive at a “gold corpus”, i.e. all named entities on the pages have been manually marked
- We train our NER software based on the gold corpus derived in step (4)
- We process the OCR again with our NER software trained on the gold corpus
- We repeat steps (2) – (6) until the results of the tagging won’t improve any further
Named entity recognition is typically evaluated by means of Precision/Recall and F-measure. Precision gives an account of how many of the named entities that the software found are in fact named entities of the correct type, while Recall states how many of the total amount of named entities present have been detected by the software. The F-measure then combines both scores into a weighted average between 0 – 1. Here are our (preliminary) results for Dutch so far:
These figures have been derived from a k-fold cross-evaluation of 25 out of 100 manually tagged pages of Dutch newspapers from the KB. The results confirm the fact that the Stanford system tends to be a bit “conservative”, i.e. it has a somewhat lower recall for the benefit of higher precision, which is also what we wanted.
Conclusion and outlook
Within this final year of the project we are looking forward to see in how far we can still boost these figures by adopting the extra modules from INL, and what results we can achieve on the French and German newspapers. We will also investigate software for linking the named entities to additional online resource descriptions and authority files such as DBPedia or VIAF to create Linked Data. The crucial question will be how well we can disambiguate the named entities and find a correct match in these resources. Besides, if there is time, we would also want to experiment with NER in other languages, such as Serbian or Latvian. And, if all goes well, you might already hear more about this at the upcoming IFLA newspapers conference “Digital transformation and the changing role of news media in the 21st Century“.
- Get the source code of the NER tool for Europeana Newspapers: https://github.com/KBNLresearch/europeananp-ner/
- Get models and training data for the NER tool for Europeana Newspapers:
- The Stanford University NER tool:
- The INL NER tool:
This year in November, it has been exactly 10 years that I have been more or less involved with digital libraries and OCR. In fact, my first encounter with OCR even predates the digital library: during my student days, one of my fellow students was blind, and I was helping him out with his studies by scanning and OCR-ing the papers he needed, so their contents could be read out to him using Text2Speech software or printed on a braille display. Looking back, OCR technology has evolved significantly in many areas since then. Projects like MetaE and IMPACT have greatly improved the capabilities of OCR technology to recognize historical fonts, and open source tools such as Google’s Tesseract or those offered by the IMPACT Centre of Competence are getting closer and closer to the functionalities and success rates offered by commercial products.
Accordingly, I would like to take this opportunity to present you some thoughts and recommendations that I’ve derived from my personal experience of 10+ years with OCR processing.
A final caveat: while this is a very interesting discussion, I will not say a single word here about whether to perform OCR as an in-house activity or via out-sourcing. My general assumption is that below considerations can provide useful information for both scenarios.
1. Know your material
The more you know about the material / collection you are aiming to OCR, the better. Some characteristics are essential for the configuration of the OCR, like e.g. the language of a document and the fonts (Antiqua, Gothic, Cyrillic, etc.) present. While such information is typically not available in library catalogues, sending documents in French language to an OCR engine configured to recognize English will yield equally poor results as trying to OCR a Gothic typeface with Antiqua settings.
Fortunately there are some helpful tools available – e.g. Apache Tika can detect the language of a document quite reliably. You may consider running such or similar characterization software in a pre-processing step to gather additional information about the content for a more fine-grained configuration of the OCR software.
Some more features in the running text the presence and frequency of which could influence your OCR setup are: tables and illustrations, paragraphs with rotated text, handwritten annotations, foldouts.
2. Capture high quality – INPUT
Once you are ready to proceed to the image capture step it is important to think about how to set this up. While recent experiments have shown that (on simple documents) there is no apparent loss in recognition quality from using e.g. compressed JPEG images for OCR, my recommendation still remains to scan with the highest optical resolution (typically 300 or 400 ppi) and store the result in an uncompressed format like TIFF or PNG (or even the RAW data directly from the scanner).
While this may result in huge files and storage costs (btw, did you know that the cost per GB of hard drive space drop by 48% every year?), keep in mind that any form of post-processing or compression does essentially reduce the amount of information available in the image for subsequent processing – and it turns out that OCR engines are becoming more and more sophisticated in using this information (e.g. colour) to improve recognition. However, once gone, this information can never be retrieved again without rescanning. If you binarize (=convert to black-and-white) your images immediately after scanning, you won’t be able to leverage the benefits of the next-generation OCR system that requires greyscale or colour documents.
It may also be worthwhile mentioning that while this has never been made very explicit, the classifiers in many OCR engines are optimized for an optical resolution of 300 ppi, and deliver the best recognition rates with documents in that particular resolution. Only in the case of very small characters (as e.g. found on large newspaper pages) can it make sense to scale the image up to 600 ppi for better OCR results.
3. Capture high quality – OUTPUT
OCR is still a costly process – from preparation to execution, costs can easily amount to between .5 up to .50 € per page. Thus you want to make sure that you derive the most possible value from it. Don’t be satisfied with plain text only! Nowadays some form of XML with (at least) basic structuring and most importantly positional information on the level of blocks / regions, or even better line and word or sometimes even glyph level, should always be available after OCR. ALTO is one commonly used standard for representing such information in an XML format, but also TEI or other XML-based formats can be a good choice.
Not only does the coordinate information enable greatly enhanced search and display of search results (hit term highlighting), there are also many further application scenarios such as the automated generation of table of contents, the production of eBooks, the presentation on mobile devices etc. that rely heavily on structural and layout information being available from OCR processing.
4. Manage expectations
No matter how modern and in pristine condition your documents are, or whether you use the most advanced scanning equipment and highly configured OCR software, it is quite unrealistic to expect anything more than 90 – 95 % word accuracy from automatic processing. Most of the times though you will be happy to even come anywhere near that range.
Note that most commercial OCR engines calculate error rates based on characters and not words. This can be very misleading, since users will want to search for words. Given there are only 30 errors across a single page with 3000 characters, the character error rate (30/3000, 0,01%) seems exceptionally low. But now assume the 3000 characters boil down to only around 600 words – and the 30 erroneous characters are well distributed across different words. We arrive at an actually much higher (5x) error rate (30/600, 0,05%). To make things worse, OCR engines typically report a “confidence score” in the output. This however only means that the software believes with a certain threshold to have recognized a character or word correctly or incorrectly. These “assumptions”, despite conservative, are unfortunately often found not to be true. That is why the only possible way to derive absolutely reliable OCR accuracy scores is by the use of ground truth-driven evaluation, which is expensive and cumbersome to perform.
Obviously all of this has implications on the quality of any service based on the OCR result. These issues must be made transparent to the organization, and should in all cases also be communicated to the end user.
5. Exploit full text to the fullest
Once you derive full text from OCR processing, it can be the first stepping stone for a wide array of further enhancements of your digital collection. Life does not stop with (even good) OCR results!
Full text gives you the ability to exploit a multitude of tools for natural language processing (NLP) on the content. Named entity recognition, topic modelling, sentiment analysis, keyword extraction etc. are just a few of the possibilities to further refine and enrich the full text.
6. Tailor the workflow
The enemy of large-scale automated processing, it can nevertheless often be worthwhile investing some more time and tailor the OCR processing flow to the characteristics of the source material. There are highly specialized modules and engines for particular pre- and post-processing tasks, and integrating these with your workflow for a very particular subset of a collection can often yield surprising improvements in the quality of the result.
7. Use all available resources
One of the important findings of the IMPACT project was that the use of additional language technologies can boost OCR recognition by an amount than cannot realistically be expected from even major breakthroughs in pattern recognition algorithms. Especially in dealing with historical material there is a lot of spelling variation, and it gets extremely difficult for the OCR software to correctly detect these old words. Making the OCR software aware of historical spelling by supplying it with a historical dictionary or word list can deliver dramatic improvements here. In addition, new technologies can detect valid historical spelling variants and distinguish them from common OCR errors. This makes it much quicker and easier to correct those OCR mistakes while retaining the proper historical word forms (i.e. no normalization is applied).
8. Try out different solutions
There is a surprisingly large number of OCR software available, both freely and commercially. The Succeed project compiled information about all OCR and related software tools in a huge database that you can search here.
Also quite useful in this are the IMPACT Framework and Demonstrator Platform – these tools allow you to test different solutions for OCR and related tasks online, or even combine distinct tools into comprehensive document recognition workflows and compare those using samples of the material you have to process.
9. Consult experts
All over the world people are applying, researching and sometimes re-inventing OCR technology. The IMPACT Centre of Competence provides a great entry point to that community. eMOP is another large OCR project currently run in the US. Consult with the community to find out about others who may have done projects similar to yours in the past and who can share findings or even technology.
Finally, consider visiting one of the main conferences in the field, such as ICDAR or ICPR and look at the relevant journal publications by IAPR etc. There is also a large community of OCR and pattern recognition experts in the Biosciences, e.g. in iDigBio. Hackathons like for example the ones organized by Succeed can provide you with hands-on experience with the tools and technologies being available for OCR.
10. Consider post-correction
When all other things fail and you just can’t obtain the desired accuracy using automated processing methods, post-correction is often the only possible way to increase the quality of the text to a level suitable for scientific study and text mining. There are many solutions offered to adopt OCR post-correction, from simple-to-use crowdsourcing efforts to rather specialized tools for experts. Gamification of OCR correction has also been explored by some. And as a side effect you may also learn to interact more closely with your users and understand their needs.
With this I hope to have given you some points to take into consideration when planning your next OCR project and wish you much success in doing so. If you would like to comment on any of the points mentioned or maybe share your personal experience with an OCR project, we would be very happy to hear from you!
As was posted earlier on this blog, the KB participates in the European project Europeana Newspapers. In this project, we are working together with 17 other institutions (libraries, technical partners and networking partners) to make 18 million European newspapers pages available via Europeana on title level. Next to this, The European Library is working on a specifically built portal to also make the newspapers available as full-text. However, many of the libraries do not have OCR for their newspapers yet, which is why the project is working together with the University of Innsbruck, CCS Content Conversion Specialists GmbH from Hamburg and the KB to enrich these pages with OCR, Optical Layout Recognition (OLR), and Named Entity Recognition (NER).
In June, the project had a workshop on refinement, but it was now time to discuss aggregation and presentation. This workshop took place in Amsterdam on 16 September, during The European Library Annual Event. There was a good group of people, not only from the project partners and the associated partners, but also from outside the consortium. After the project, TEL hopes to be able to also offer these institutions a chance to send in their newspapers for Europeana, so we were very happy to have them join us.
The workshop kicked off with an introduction from Marieke Willems of LIBER and Hans-Joerg Lieder of the Berlin State Library.. They were followed by Markus Muhr from TEL, who introduced the aggregation plan and the schedule for the project partners. With so many partners, it can be quite difficult to find a schedule that works well, to ensure everyone has their material sent in on time. After the aggregation, TEL will then have to do some work on the metadata to convert it to the Europeana Data Model. Markus was followed by a presentation from Channa Veldhuijsen from the KB, who unfortunately, could not be there in person. However, her elaborate presentation on usability testing provided some good insights on how to get your website to be the best it can be and how to find out what your users really think when they are browsing your site.
It was then time for Alastair Dunning from TEL to showcase the portal that they have been preparing for Europeana Newspapers. Unfortunately, the wifi connection was not up to so many visitors and only some people could follow his presentation along on their own devices. However, there were some valuable feedback points which TEL will use to improve the portal. Unfortunately, the portal is not yet available from outside, so people who missed the presentation need to wait a bit longer to be able to see and browse the European newspapers.
But what we do already can see, are some websites of partners that have already been online for some time. It was very interesting to see the different choices each partner made to showcase their collection. We heard from people from the British Library, the National and University Library of Iceland, the National and University Library of Slovenia, the National Library of Luxembourg and the National Library of the Czech Republic.
The day ended with a lovely presentation by Dean Birkett of Europeana, who, partly with Channa’s notes, went to all the previously presented websites and offered comments on how to improve them. The videos he used in his talk are available on Youtube. His key points were:
- Make the type size large: 16px is the recommended size.
- Be careful of colours. Some online newspapers sites use red to highlight important information but red is normally associated with warning signals and errors in the user’s mind.
- Use words to indicate language choices (eg. ‘english’, ‘français’) not flags. The Spanish flag won’t necessarily be interpreted to mean ‘click here for spanish’ if the user is from Mexico.
- Cut down on unnecessary text. Make it easy for users to skim (eg. though the use of bullet points).
All in all, it was a very useful afternoon in which I learned a lot about what users want from a website. If you want to see more, all presentations can be found at the Slideshare account of Europeana Newspapers or join us at one of the following events:
- Workshop on Newspapers in Europe and the Digital Agenda. British Library, London. September 29-30th, 2014.
- National Information Days.
- National Library of Austria. March 25-26th, 2014.
- National Library of France. April 3rd, 2014.
- British Library. June 9th, 2014.