OCR improvement: helping and hindering researchers

Author: Tineke Koster

As I am writing this, volunteers are rekeying our 17th century newspapers articles. Optical character recognition of the gothic text type in use at the time has yielded poor results, making this part of our digital collection nearly inaccessible for full-text search. The Meertens institute, who have an excellent track record when it comes to crowdsourcing, has developed the editor (Dutch). Together with them we are working towards a full update of all newspaper issues from 1618 to 1700 that are available in our website Delpher.

Great news and, for some researchers, an eagerly awaited development. A bright future beckons in which our digital text corpus is 100% correct, just waiting to be mined for dynamic phenomena and paradigm shifts.

But we have to realize that without the proper precautions, correcting digital texts may also hinder researchers in their work. How so? These texts may have been used (browsed, mined, cited, etc.) by researchers in their earlier form. The improvement or enrichment may have consequences for the reproducibility of their research results.

For all researchers the need to reproduce research results is growing, with new guidelines due to new laws. There is also a specific group of researchers that need sustained access to older versions of digital text. The need is highest for research where the goal is to develop an algorithm and to assess its quality relative to previous versions of the same algorithm or to other algorithms. Without sustained access to older versions, these people cannot do their work.

Is it our role to provide this access? How the National Library of the Netherlands is thinking about this issue, I hope to explain in a later blogpost (soon!). Meanwhile, I would be very interested to hear your experiences. How is this subject discussed in your organization? Does your organization have a policy in place to deal with this?

10 Tips for making your OCR project succeed

(reblogged from http://www.digitisation.eu/community/blog/article/article/10-tips-for-making-your-ocr-project-succeed/)

This year in November, it has been exactly 10 years that I have been more or less involved with digital libraries and OCR. In fact, my first encounter with OCR even predates the digital library: during my student days, one of my fellow students was blind, and I was helping him out with his studies by scanning and OCR-ing the papers he needed, so their contents could be read out to him using Text2Speech software or printed on a braille display. Looking back, OCR technology has evolved significantly in many areas since then. Projects like MetaE and IMPACT have greatly improved the capabilities of OCR technology to recognize historical fonts, and open source tools such as Google’s Tesseract or those offered by the IMPACT Centre of Competence are getting closer and closer to the functionalities and success rates offered by commercial products.

Accordingly, I would like to take this opportunity to present you some thoughts and recommendations that I’ve derived from my personal experience of 10+ years with OCR processing.

A final caveat: while this is a very interesting discussion, I will not say a single word here about whether to perform OCR as an in-house activity or via out-sourcing. My general assumption is that below considerations can provide useful information for both scenarios.

1.    Know your material

The more you know about the material / collection you are aiming to OCR, the better. Some characteristics are essential for the configuration of the OCR, like e.g. the language of a document and the fonts (Antiqua, Gothic, Cyrillic, etc.) present. While such information is typically not available in library catalogues, sending documents in French language to an OCR engine configured to recognize English will yield equally poor results as trying to OCR a Gothic typeface with Antiqua settings.

Fortunately there are some helpful tools available – e.g. Apache Tika can detect the language of a document quite reliably. You may consider running such or similar characterization software in a pre-processing step to gather additional information about the content for a more fine-grained configuration of the OCR software.

Some more features in the running text the presence and frequency of which could influence your OCR setup are: tables and illustrations, paragraphs with rotated text, handwritten annotations, foldouts.

2.    Capture high quality – INPUT

Once you are ready to proceed to the image capture step it is important to think about how to set this up. While recent experiments have shown that (on simple documents) there is no apparent loss in recognition quality from using e.g. compressed JPEG images for OCR, my recommendation still remains to scan with the highest optical resolution (typically 300 or 400 ppi) and store the result in an uncompressed format like TIFF or PNG (or even the RAW data directly from the scanner).

While this may result in huge files and storage costs (btw, did you know that the cost per GB of hard drive space drop by 48% every year?), keep in mind that any form of post-processing or compression does essentially reduce the amount of information available in the image for subsequent processing – and it turns out that OCR engines are becoming more and more sophisticated in using this information (e.g. colour) to improve recognition. However, once gone, this information can never be retrieved again without rescanning. If you binarize (=convert to black-and-white) your images immediately after scanning, you won’t be able to leverage the benefits of the next-generation OCR system that requires greyscale or colour documents.

It may also be worthwhile mentioning that while this has never been made very explicit, the classifiers in many OCR engines are optimized for an optical resolution of 300 ppi, and deliver the best recognition rates with documents in that particular resolution. Only in the case of very small characters (as e.g. found on large newspaper pages) can it make sense to scale the image up to 600 ppi for better OCR results.

3.    Capture high quality – OUTPUT

OCR is still a costly process – from preparation to execution, costs can easily amount to between .5 up to .50 € per page. Thus you want to make sure that you derive the most possible value from it. Don’t be satisfied with plain text only! Nowadays some form of XML with (at least) basic structuring and most importantly positional information on the level of blocks / regions, or even better line and word or sometimes even glyph level, should always be available after OCR. ALTO is one commonly used standard for representing such information in an XML format, but also TEI or other XML-based formats can be a good choice.

Not only does the coordinate information enable greatly enhanced search and display of search results (hit term highlighting), there are also many further application scenarios such as the automated generation of table of contents, the production of eBooks, the presentation on mobile devices etc. that rely heavily on structural and layout information being available from OCR processing.

4.    Manage expectations

No matter how modern and in pristine condition your documents are, or whether you use the most advanced scanning equipment and highly configured OCR software, it is quite unrealistic to expect anything more than 90 – 95 % word accuracy from automatic processing. Most of the times though you will be happy to even come anywhere near that range.

Note that most commercial OCR engines calculate error rates based on characters and not words. This can be very misleading, since users will want to search for words. Given there are only 30 errors across a single page with 3000 characters, the character error rate (30/3000, 0,01%) seems exceptionally low. But now assume the 3000 characters boil down to only around 600 words – and the 30 erroneous characters are well distributed across different words. We arrive at an actually much higher (5x) error rate (30/600, 0,05%). To make things worse, OCR engines typically report a “confidence score” in the output. This however only means that the software believes with a certain threshold to have recognized a character or word correctly or incorrectly. These “assumptions”, despite conservative, are unfortunately often found not to be true. That is why the only possible way to derive absolutely reliable OCR accuracy scores is by the use of ground truth-driven evaluation, which is expensive and cumbersome to perform.

Obviously all of this has implications on the quality of any service based on the OCR result. These issues must be made transparent to the organization, and should in all cases also be communicated to the end user.

5.    Exploit full text to the fullest

Once you derive full text from OCR processing, it can be the first stepping stone for a wide array of further enhancements of your digital collection. Life does not stop with (even good) OCR results!

Full text gives you the ability to exploit a multitude of tools for natural language processing (NLP) on the content. Named entity recognition, topic modelling, sentiment analysis, keyword extraction etc. are just a few of the possibilities to further refine and enrich the full text.

6.    Tailor the workflow

The enemy of large-scale automated processing, it can nevertheless often be worthwhile investing some more time and tailor the OCR processing flow to the characteristics of the source material. There are highly specialized modules and engines for particular pre- and post-processing tasks, and integrating these with your workflow for a very particular subset of a collection can often yield surprising improvements in the quality of the result.

7.    Use all available resources

One of the important findings of the IMPACT project was that the use of additional language technologies can boost OCR recognition by an amount than cannot realistically be expected from even major breakthroughs in pattern recognition algorithms. Especially in dealing with historical material there is a lot of spelling variation, and it gets extremely difficult for the OCR software to correctly detect these old words. Making the OCR software aware of historical spelling by supplying it with a historical dictionary or word list can deliver dramatic improvements here. In addition, new technologies can detect valid historical spelling variants and distinguish them from common OCR errors. This makes it much quicker and easier to correct those OCR mistakes while retaining the proper historical word forms (i.e. no normalization is applied).

8.    Try out different solutions

There is a surprisingly large number of OCR software available, both freely and commercially. The Succeed project compiled information about all OCR and related software tools in a huge database that you can search here.

Also quite useful in this are the IMPACT Framework and Demonstrator Platform – these tools allow you to test different solutions for OCR and related tasks online, or even combine distinct tools into comprehensive document recognition workflows and compare those using samples of the material you have to process.

9.    Consult experts

All over the world people are applying, researching and sometimes re-inventing OCR technology. The IMPACT Centre of Competence provides a great entry point to that community. eMOP is another large OCR project currently run in the US. Consult with the community to find out about others who may have done projects similar to yours in the past and who can share findings or even technology.

Finally, consider visiting one of the main conferences in the field, such as ICDAR or ICPR and look at the relevant journal publications by IAPR etc. There is also a large community of OCR and pattern recognition experts in the Biosciences, e.g. in iDigBioHackathons like for example the ones organized by Succeed can provide you with hands-on experience with the tools and technologies being available for OCR.

10.    Consider post-correction

When all other things fail and you just can’t obtain the desired accuracy using automated processing methods, post-correction is often the only possible way to increase the quality of the text to a level suitable for scientific study and text mining. There are many solutions offered to adopt OCR post-correction, from simple-to-use crowdsourcing efforts to rather specialized tools for experts. Gamification of OCR correction has also been explored by some. And as a side effect you may also learn to interact more closely with your users and understand their needs.

With this I hope to have given you some points to take into consideration when planning your next OCR project and wish you much success in doing so. If you would like to comment on any of the points mentioned or maybe share your personal experience with an OCR project, we would be very happy to hear from you!

Presenting European Historic Newspapers Online

As was posted earlier on this blog, the KB participates in the European project Europeana Newspapers. In this project, we are working together with 17 other institutions (libraries, technical partners and networking partners) to make 18 million European newspapers pages available via Europeana on title level. Next to this, The European Library is working on a specifically built portal to also make the newspapers available as full-text. However, many of the libraries do not have OCR for their newspapers yet, which is why the project is working together with the University of Innsbruck, CCS Content Conversion Specialists GmbH from Hamburg and the KB to enrich these pages with OCR, Optical Layout Recognition (OLR), and Named Entity Recognition (NER).

Hans-Jörg Lieder

Hans-Jorg Lieder of the Berlin State Library presents the Europeana Newspapers Project at our September 2013 workshop in Amsterdam.

In June, the project had a workshop on refinement, but it was now time to discuss aggregation and presentation. This workshop took place in Amsterdam on 16 September, during The European Library Annual Event. There was a good group of people, not only from the project partners and the associated partners, but also from outside the consortium. After the project, TEL hopes to be able to also offer these institutions a chance to send in their newspapers for Europeana, so we were very happy to have them join us.

The workshop kicked off with an introduction from Marieke Willems of LIBER and Hans-Joerg Lieder of the Berlin State Library.. They were followed by Markus Muhr from TEL, who introduced the aggregation plan and the schedule for the project partners. With so many partners, it can be quite difficult to find a schedule that works well, to ensure everyone has their material sent in on time. After the aggregation, TEL will then have to do some work on the metadata to convert it to the Europeana Data Model. Markus was followed by a presentation from Channa Veldhuijsen from the KB, who unfortunately, could not be there in person. However, her elaborate presentation on usability testing provided some good insights on how to get your website to be the best it can be and how to find out what your users really think when they are browsing your site.

It was then time for Alastair Dunning from TEL to showcase the portal that they have been preparing for Europeana Newspapers. Unfortunately, the wifi connection was not up to so many visitors and only some people could follow his presentation along on their own devices. However, there were some valuable feedback points which TEL will use to improve the portal. Unfortunately, the portal is not yet available from outside, so people who missed the presentation need to wait a bit longer to be able to see and browse the European newspapers.

But what we do already can see, are some websites of partners that have already been online for some time. It was very interesting to see the different choices each partner made to showcase their collection. We heard from people from the British Library, the National and University Library of Iceland, the National and University Library of Slovenia, the National Library of Luxembourg and the National Library of the Czech Republic.

P1100058

Yves Mauer from the National Library of Luxembourg presenting their newspaper portal

The day ended with a lovely presentation by Dean Birkett of Europeana, who, partly with Channa’s notes, went to all the previously presented websites and offered comments on how to improve them. The videos he used in his talk are available on Youtube. His key points were:

  1. Make the type size large: 16px is the recommended size.
  2. Be careful of colours. Some online newspapers sites use red to highlight important information but red is normally associated with warning signals and errors in the user’s mind.
  3. Use words to indicate language choices (eg. ‘english’, ‘français’) not flags. The Spanish flag won’t necessarily be interpreted to mean ‘click here for spanish’ if the user is from Mexico.
  4. Cut down on unnecessary text. Make it easy for users to skim (eg. though the use of bullet points).

All in all, it was a very useful afternoon in which I learned a lot about what users want from a website. If you want to see more, all presentations can be found at the Slideshare account of Europeana Newspapers or join us at one of the following events:

  • Workshop on Newspapers in Europe and the Digital Agenda. British Library, London. September 29-30th, 2014.
  • National Information Days.
    • National Library of Austria. March 25-26th, 2014.
    • National Library of France. April 3rd, 2014.
    • British Library. June 9th, 2014.

Europeana Newspapers Refinement & Aggregation Workshop

The KB participates in the Europeana Newspapers project that has started in February 2012. The project will enrich 18 million pages of digitised newspapers with Optical Character Recognition (OCR), Optical Layout Recognition (OLR) and Named Entity Recognition (NER) from all over Europe and deliver them to Europeana. The project consortium consists of 18 partners from all over Europe: some will provide (technical) support, while other will provide their digitised newspapers. The KB has two roles: we will not only deliver 2 million of our newspaper pages to Europeana, but we will also enrich ours and the newspapers of other partners with NER.

Untitled

Europeana Newspapers Workshop in Belgrade

In the last months, the project has welcomed 11 new associated partners and to make sure they can benefit as much as possible from the experiences of the project partners the University Library of Belgrade and LIBER jointly organised a workshop on refinement and aggregation on 13 and 14 June. Here, the KB (Clemens Neudecker and I) presented the work that is currently being done to make sure that we will have Named Entities for several partners. To make sure that the work that is being done in the project also benefits our direct colleagues, we were joined by someone from our Digitisation department.

The workshop started with a warm welcome in Belgrade by the director of the library, Prof. Aleksandar Jerkov. After a short introduction into the project by the project leader Hans-Jörg Lieder from the State Library Berlin, Clemens Neudecker from the KB presented the refinement process of the project. All presentations will be shared on the project’s Slideshare account. The refinement of the newspapers has already started and is being done by the University of Innsbruck and the company CCS in Hamburg. However, it was still a big surprise when Hans-Jörg Lieder announced a present for the director of the University Library Belgrade; the first batch of their processed newspapers!

Giving a gift of 200,000 digitised and refined newspapers to our Belgrade hosts

Giving a gift of 200,000 digitised and refined newspapers to our Belgrade hosts

The day continued with an introduction into the importance of evaluation of OCR and OLR and a demonstration of the tools used for this by Stefan Pletschacher and Cristian Clausner from the University of Salford. This sparked some interesting discussions in the break-out sessions on methods of evaluation in the libraries digitising their collections. For example, do you tell your service provider what you will be checking when you receive a batch? You could argue that the service provider would then only fix what you check. On the other hand if that is what you need to reach your goal it would save a lot of time and rejected batches.

After a short getting-to-know-each-other session the whole workshop party moved to the Nikola Tesla Museum nearby where we were introduced to their newspaper clippings project. All newspaper clippings collected by Nikola Tesla are now being digitised for publication on the museum’s website. A nice tour through the museum followed with several demonstrations (don’t worry, no one was electrocuted) and the day was concluded with a dinner in the bohemian quarter.

Breakout groups at the Belgrade Workshop

The second day of the workshop was dedicated solely to refinement. I kicked off the day with the question ‘What is a named entity?’. This sounds easy, but can provide you with some dilemmas as well. For example, a dog’s name is a name, but do you want it to be tagged as a NE? And what do you do with a title such as Romeo and Juliet? Consistency is key in this and as long as you keep your goal in mind while training your software you should end up with the results you are looking for.

Claus Gravenhorst followed me with his presentation on OLR at CCS, by using docWorks, with which they will process 2 million pages. It was then again our turn with a hands-on session about the tools we’re using, which are also available on Github. The last session of the workshop was a collaboration between Claus Gravenhorst from CCS and Günter Mühlberger from the University of Innsbruck who gave us a nice insight into their tools and the considerations made when working with digitised newspapers. For example, how many categories would you need to tag every article?

Group photo from the Europeana Newspapers workshop in Belgrade

All in all, it was a very successful workshop and I hope that all participants enjoyed it as much as I have. I at least am happy to have spoken to so many interesting people with new experiences from other digitisation projects. There is still much to learn from each other and projects like Europeana Newspapers contribute towards a good exchange of knowledge between libraries to ensure our users get the best experience when browsing through the rich digital collections.