Working together to improve text digitisation techniques

2nd Succeed hackathon at the University of Alicante

Ready to start 2

Is there any one still out there who thinks a hackathon is a malicious break-in? Far from it. It is the best way for developers and researchers to get together and work on new tools and innovations. The 2nd developers workshop / hackathon organised on 10-11 April by the Succeed Project was a case in point: bringing together people to work on new ideas and new inspiration for better OCR. The event was held in the “Claude Shannon” aula of the Department of Software and Computing Systems (DLSI) of the University of Alicante, Spain. Claude Shannon was a famous mathematician and engineer and is also known as the “father of information theory”. So it seems like a good place to have a hackathon!

Clemens explains what a hackathon is and what we hope to achieve with it for Succeed.

Same as last year, we again provided a wiki upfront with some information about possible topics to work on, as well as a number of tools and data that participants could experiment with before and during the event. Unfortunately there was an unexpectedly high number of no-shows this time – we try to keep these events free and open to everyone, but may have to think about charging at least a no-show fee in the future, as places are usually limited. Or did those hackers simply have to stay home to fix the heartbleed bug on their servers? We will probably never find out.

Collaboration, open source tools, open solutions

Nevertheless, there was a large enough group of programmers and researchers from Germany, Poland, the Netherlands and various parts of Spain eager to immerse themselves deeply into a diverse list of topics. Already in the introduction we agreed to work on open tools and solutions, and quickly identified some areas in which open source tool support for text digitisation is still lacking (see below). Actually, one of the first things we did was to set up a local git repository, and people were pushing code samples, prototypes and interesting projects to share with the group during both days.

Second Day, April 11_5

What’s the status of open source OCR?

Accordingly, Jesús Dominguez Muriel from Digibís (the company that also made http://www.digibis.com/dpla-europeana/)  started an investigation into open source OCR tools and frameworks. He made a really detailed analysis of the status of open source OCR, which you can find here. Thanks a lot for that summary, Jesús! At the end of his presentation, Jesús also suggested an “algorithm wikipedia” – I guess something similar to RosettaCode but then specifically for OCR. This would indeed be very useful to share algorithms but also implementations and prevent reinventing (or reimplementing) the wheel. Something for our new OCRpedia, perhaps?

A method for assessing OCR quality based on ngrams

As turned out on the second day, a very promising idea seemed to be using ngrams for assessing the quality of an OCR’ed text, without the need for ground truth. Well, in fact you do still need some correct text to create the ngram model, but one can use texts from e.g. Project Gutenberg or aspell for that. Two groups started to work on this: while Willem Jan Faber from the KB experimented with a simple Python script for that purpose, the group of Rafael Carrasco, Sebastian Kirch and Tomasz Parkola decided to implement this as a new feature in the Java ocrevalUAtion tool (check the work-in-progress “wip” branch).

Second Day, April 11_4

Jesús in the front, Rafael, Sebastian and Tomasz discussing ngrams in the back.

Aligning text and segmentation results

Another very promising development was started by Antonio Corbi from the University of Alicante. He worked on a software to align plain text and segmentation results. The idea is to first identify all the lines in a document, segment them into words and eventually individual charcaters, and then align the character outlines with the text in the ground truth. This would allow (among other things) creating a large corpus of training material for an OCR classifier based on the more than 50,000 images with ground truth produced in the IMPACT Project, for which correct text is available, but segmentation could only be done on the level of regions. Another great feature of Antonio’s tool is that while he uses D as a programming language, he also makes use of GTK, which has the nice effect that his tool does not only work on the desktop, but also as a web application in a browser.

aligner

OCR is complicated, but don’t worry – we’re on it!

Gustavo Candela works for the Biblioteca Virtual Miguel de Cervantes, the largest Digital Library in the Spanish speaking world. Usually he is busy with Linked Data and things like FRBR, so he was happy to expand his knowledge and learn about the various processes involved in OCR and what tools and standards are commonly used. His findings: there is a lot more complexity involved in OCR than appears at first sight. And again, for some problems it would be good to have more open source tool support.

In fact, at the same time as the hackathon, at the KB in The Hague, the ‘Mining Digital Repositories‘ conference was going on where the problem of bad OCR was discussed from a scholarly perspective. And also there, the need for more open technologies and methods was apparent:

[tweet 454528200572682241 hide_thread=’true’]

Open source border detection

One of the many technologies for text digitisation that are available in the IMPACT Centre of Competence for image pre-processing is Border Removal. This technique is typically applied to remove black borders in a digital image that have been captured while scanning a document. The borders don’t contain any information, yet they take up expensive storage space, so removing the borders without removing any other relevant information from a scanned document page is a desirable thing to do. However, there is no simple open source tool or implementation for doing that at the moment. So Daniel Torregrosa from the University of Alicante started to research the topic. After some quick experiments with tools like imagemagick and unpaper, he eventually decided to work on his own algorithm. You can find the source here. Besides, he probably earns the award for the best slide in a presentation…showing us two black pixels on a white background!

A great venue

All in all, I think we can really be quite happy with these results. And indeed the University of Alicante also did a great job hosting us – there was an excellent internet connection available via cable and wifi, plenty of space and tables to discuss in groups and we were distant enough from the classrooms not to be disturbed by the students or vice versa. Also at any time there was excellent and light Spanish food – Gazpacho, Couscous with vegetables, assorted Montaditos, fresh fruit…nowadays you won’t make hackers happy with just pizza anymore! Of course there were also ice-cooled drinks and hot coffee, and rumours spread that there were also some (alcohol-free?) beers in the cooler, but (un)fortunately there is no more documentary evidence of that…

To be continued!

If you want to try out any of the software yourself, just visit our github and have go! Make sure to also take a look at the videos that were made with participants Jesús, Sebastian and Tomasz, explaining their intentions and expectations for the hackathon. And at the next hackathon, maybe we can welcome you too amongst the participants?

Named entity recognition for digitised historical newspapers

Europeana NewspapersThe refinement partners in the Europeana Newspapers project will produce the astonishing amount of 10 million pages of full-text from historical newspapers from all over Europe. What could be done to further enrich that full-text?

The KB National Library of the Netherlands has been investigating named entity recognition (NER) and linked data technologies for a while now in projects such as IMPACT and STITCH+, and we felt it was about time to approach this on a production scale. So we decided to produce (open source) software, trained models as well as raw training data for NER software applications specifically for digitised historical newspapers as part of the project.

What is named entity recognition (NER)?

Named entity recognition is the process of identifying and classifying entities such as persons, locations and organisations in the full-text in order to enhance searchability. There are basically two types of approaches, a statistical and a rule based one. Rule based systems rely mostly on grammar rules defined by linguists, while statistical systems require large amounts of manually produced training data that they can learn from. While both approaches have their benefits and drawbacks, we decided to go for a statistical tool, the CRFNER system from Stanford University. In comparison, this software proved to be the most reliable, and it is supported by an active user community. Stanford University has an online demo where you can try it out: http://nlp.stanford.edu:8080/ner/.

ner

Example of Wikipedia article for Albert Einstein, tagged with the Stanford NER tool

Requirements & challenges

There are some particular requirements and challenges when applying these techniques to digital historical newspapers. Since full-text for more than 10 million pages will be produced in the project, one requirement for our NER tool was that it should be able to process large amounts of texts in a rather short time. This is possible with the Stanford tool,  which as of version 1.2.8 is “thread-safe”, i.e. it can run in parallel on a multi-core machine. Another requirement was to preserve the information about where on a page a named entity has been detected – based on coordinates. This is particularly important for newspapers: instead of having to go through all the articles on a newspaper page to find the named entity, it can be highlighted so that one can easily spot it even on very dense pages.

Then there are also challenges of course – mainly due to the quality of the OCR and the historical spelling that is found in many of these old newspapers. In the course of 2014 we will thus collaborate with the Dutch Institute for Lexicology (INL), who have produced modules which can be used in a pre-processing step before the Stanford system and that can to some extent mitigate problems caused by low quality of the full-text or the appearance of historical spelling variants.

The Europeana Newspapers NER workflow

For Europeana Newspapers, we decided to focus on three languages: Dutch, French and German. The content in these three languages makes up for about half of the newspaper pages that will become available through Europeana Newspapers. For the French materials, we cooperate with LIP6-ACASA, for Dutch again with INL. The workflow goes like this:

  1. We receive OCR results in ALTO format (or METS/MPEG21-DIDL containers)
  2. We process the OCR with our NER software to derive a pre-tagged corpus
  3. We upload the pre-tagged corpus into an online Attestation Tool (provided by INL)
  4. Within the Attestation Tool, the libraries make corrections and add tags until we arrive at a “gold corpus”, i.e. all named entities on the pages have been manually marked
  5. We train our NER software based on the gold corpus derived in step (4)
  6. We process the OCR again with our NER software trained on the gold corpus
  7. We repeat steps (2) – (6) until the results of the tagging won’t improve any further

    NER slide

    Screenshot of the NER Attestation Tool

Preliminary results

Named entity recognition is typically evaluated by means of Precision/Recall and F-measure. Precision gives an account of how many of the named entities that the software found are in fact named entities of the correct type, while Recall states how many of the total amount of named entities present have been detected by the software. The F-measure then combines both scores into a weighted average between 0 – 1. Here are our (preliminary) results for Dutch so far:

Dutch

Persons

Locations

Organizations

Precision

0.940

0.950

0.942

Recall

0.588

0.760

0.559

F-measure

0.689

0.838

0.671

These figures have been derived from a k-fold cross-evaluation of 25 out of 100 manually tagged pages of Dutch newspapers from the KB. The results confirm the fact that the Stanford system tends to be a bit “conservative”, i.e. it has a somewhat lower recall for the benefit of higher precision, which is also what we wanted.

Conclusion and outlook

Within this final year of the project we are looking forward to see in how far we can still boost these figures by adopting the extra modules from INL, and what results we can achieve on the French and German newspapers. We will also investigate software for linking the named entities to additional online resource descriptions and authority files such as DBPedia or VIAF to create Linked Data. The crucial question will be how well we can disambiguate the named entities and find a correct match in these resources. Besides, if there is time, we would also want to experiment with NER in other languages, such as Serbian or Latvian. And, if all goes well, you might already hear more about this at the upcoming IFLA newspapers conference “Digital transformation and the changing role of news media in the 21st Century“.

References

The Elephant Returns to the Library…with a Pig!

Hadoop Driven Digital Preservation Hackathon in Vienna

Organised by: SCAPE project  and  Open Planets Foundation

By Clemens Neudecker & René van der Ark

These days, libraries are no longer exclusively collecting physical books and publications, but are also investing in digitisation on a massive scale. At the same time they are harvesting born-digital publications and websites alike. The problem of how to preserve all that digital information in the (very) long run has received a lot of attention. So called “preservation risks” pose a severe threat to the long-term availability of these digital assets. To name but a few: bitrot, format obsolescence and lack of open tools and frameworks.

To tackle these problems now and in the future, beefing up server performance and storage space is no longer a viable option. The growth of data is simply too fast to keep up. Therefore, in recent years, computer scientists tend to opt for scaling out, as opposed to scaling up. This is where buzzwords like the cloud and big data come in. To demystify: instead of scaling up with expensive hardware, scale out by setting up a cluster of cheap machines and doing distributed parallel calculations on them. Scaling out is now often the solution for fast processing of big data, but the principle might just as well be applied to safe (redundant) storage.

The European research project SCAPE (SCAlable Preservation Environments) has been set up in order to help the GLAM community lift their preservation technology to the big data needs of the 21st century digital library. One of the key ideas in SCAPE is to leverage big data technologies like Hadoop, and to apply them in order to scale out the preservation tools and technologies currently in use.

The hackathon on “Hadoop Driven Digital Preservation” at the Austrian National Library (ONB) in Vienna therefore provided a great opportunity for the KB to further its understanding of Hadoop and all its applications. Especially because Hadoop guru Jimmy Lin from the University of Maryland joined us not only as keynote, but also as teacher and co-hacker. Jimmy Lin has past experience working for Twitter and Cloudera and shared many insights in using Hadoop on a productive scale, several dimensions greater than what libraries are currently struggling with. One of his recent projects was to implement a web archive browser on Hadoop’s HBase called WarcBase. A great initiative which might just turn into the next generation Wayback Machine.

Besides, Vienna in December is always worth a visit!

At the event

The event started out with an introduction to the two use cases that were provided upfront by the organisers:

1) Web-Archiving: File Format Identification/Characterisation

2) Digital Books: Quality Assurance, text mining (OCR Quality)

However, participants were free to dive into either of these issues, continue developing their own projects or just investigate completely fresh ideas. So it was not a big surprise when soon after Jimmy’s first presentation introducing Pig as an alternative to writing MapReduce jobs “for lazy people”, many of the participants decided to work on creating small Pig scripts for various preservation related tools.

The nice thing about Pig is that it somewhat resembles common query languages like SQL. This makes it quite readable for most IT savy people. Also it is extensible with custom functions, which can be implemented in Java. Writing some of these user defined functions (UDF) is what we decided to focus on.

This event distinguished itself by the great amount of collaboration. As code reuse was greatly encouraged we decided to fork Jimmy Lin’s WarcBase project on github and extend it with UDF’s for language detection and MIME-type detection using Apache TIKA. The UDF’s we wrote were then in turn again used by many of the other participants’ projects.

The rest of the time we used to get more familiar with writing Pig scripts to apply on actual ARC/WARC files. While unfortunately there was a lack of publicly available ARC/WARC files for testing our MIME-type and language detection UDF’s, we were lucky that colleague Per Møldrup-Dalum from the SB in Aarhus had a cluster and a large collection from the Danish web archive  available for us to test these on:

HadoopVersion   PigVersion            UserId    StartedAt               FinishedAt
2.0.0-cdh4.0.1     0.9.2-cdh4.0.1      scape      14:19:53                 14:22:22

JobId:    job_201308301115_0151

Maps      Reduces
172         19

MaxMapTime       MinMapTime      AvgMapTime
106                         18                           77

MaxReduceTime MinReduceTime  AvgReduceTime
19                           17                           19

Alias                       Feature
a,b,c,d,e,f,raw      GROUP_BY,COMBINER

Input(s):

Successfully read 547093 records (3985472612 bytes) from:
“hdfs://zone1.isilon.sblokalnet/user/scape/arc-files/97-9-2005*”

Output(s):

Successfully stored 90 records (1613 bytes) in:
“hdfs://zone1.isilon.sblokalnet/user/scape/clemens-rene-1”

Counters:

Total records written : 90
Total bytes written : 1613
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Honestly, we were a bit surprised ourselves when we got these numbers from Per – did our simple Pig script of 7 lines really just process the almost 550,000 ARC/WARC records in only 2:29 minutes? Indeed it did!

To learn more about the various results and outcomes of the event, make sure to check the blog post by Sven Schlarb. It must be mentioned that the colleagues from the ONB and OPF did an outstanding job in terms of event preparation – next to two real-life use cases, they also provided a virtual machine with a pseudo-distributed Hadoop and some more helpful tools from the Hadoop ecosystem that could be used for experimentation. In fact, there was even a cluster with some data ready to execute MapReduce jobs against and really test out how well they would scale. Thanks again!

Outlook

While the KB has been a member of PLANETS, a predecessor to SCAPE, as well as a member of the OPF and the SCAPE project, so far we have only had little time to experiment with Hadoop in our library.

Currently we are looking into using Hadoop for migrating around 150 TB of TIF images from the Metamorfoze Programme to the JPEG2000 format. Following the example of the British Library, we started experimenting with our own implementation of a TIF → JP2 workflow using Hadoop. Will Palmer from the British Library Digital Preservation team has already successfully built such a workflow and published it on github.

However, the TIF → JP2 migration is just about the hardest scenario to optimally implement on Hadoop. The encoding algorithm is very complex and would actually need to be rewritten entirely for parallel processing to make use of the power of Hadoop. Nevertheless, we believe that Hadoop has serious potential  – so the KB is also investigating some more use cases for Hadoop, currently in at least three different areas:

1. Webarchiving

The KB is one of the partners in the WebART project, where together with researchers from the UvA and the CWI new tools and methods to maximize the web archive’s utility for research are being created. Hadoop and HBase are also amongst the applications used here. Together with CWI and UvA the KB hopes to start up a new project for establishing an instance of WarcBase running on top of a scalable HBase cluster – this would really enable new, scalable ways of researching the Dutch web history. Colleague Thaer Sammar from CWI also participated in the hackathon, and the results of his efforts were quite convincing. Also, Jimmy is again one of the collaborators – we keep your fingers crossed!

2. Content enrichment

In the Europeana Newspaper project, the KB is currently creating a framework for named entity recognition (NER) on historical newspapers from all over Europe. Around 10 million pages of full-text will be created in the project, and the KB will provide named entities software for materials in Dutch, German and French language. It is expected that at least 2 million pages will be processed with NER, but the total collection of digitised newspapers at the KB is 8.5 million. Plus there are other collections (books, journals, radiobulletins) for which OCR exists. The KB aims to have all the entities in its digital collections detected, disambiguated and linked within the next 5 years. Given that all of this is text, and the sofware for NER is in Java, it would be interesting to see in how far Hadoop could be used to scale out the processing of all this data.

3. Business processes

From the organization, we are aware of a few particular scalability issues with some of the business processes, such as:

  • Generating business reports quickly: i.e. counting all the KB’s newspapers per publisher is now a painstakingly slow and somewhat unreliable process.
  • Acting as a data provider: harvesting the newspapers’ metadata sequentially takes a week now, and will only take longer when the collection is expanded. Harvesting records in parallel (16 requests / second) created serious stress on the current middleware and even came close to crashing it.
  • Parallelizing (pre-) ingest processes, like metadata validation, file characterization and checksum validation.

Finally, we have also looked at HDFS for scalable, but durable storage. However, for preservation purposes storing files on the Hadoop cluster might introduce some risk, because the files would be partitioned and scattered across the cluster. Then again, the concept of scaling out can still be useful, for example to:

  • prevent bitrot: by replication of files and using agents which repair corrupt replica’s on a regular basis;
  • applying file migrations: here processing data on one machine in sequential batches is again not viable in the long run.

As you see, while there is no Hadoop cluster running in the KB (yet?), it seems there are sufficient ideas and use cases for continuing to work with these technologies, and to build up expertise with Hadoop, HBase etc. We are also very interested in exchanging ideas and use cases with other libraries that are already using Hadoop productively. Last but not least, Hadoopsummit Europe will be held in Amsterdam again next year – so, we’ll see you there perhaps?

Useful links:

Github:

Twitter:

10 Tips for making your OCR project succeed

(reblogged from http://www.digitisation.eu/community/blog/article/article/10-tips-for-making-your-ocr-project-succeed/)

This year in November, it has been exactly 10 years that I have been more or less involved with digital libraries and OCR. In fact, my first encounter with OCR even predates the digital library: during my student days, one of my fellow students was blind, and I was helping him out with his studies by scanning and OCR-ing the papers he needed, so their contents could be read out to him using Text2Speech software or printed on a braille display. Looking back, OCR technology has evolved significantly in many areas since then. Projects like MetaE and IMPACT have greatly improved the capabilities of OCR technology to recognize historical fonts, and open source tools such as Google’s Tesseract or those offered by the IMPACT Centre of Competence are getting closer and closer to the functionalities and success rates offered by commercial products.

Accordingly, I would like to take this opportunity to present you some thoughts and recommendations that I’ve derived from my personal experience of 10+ years with OCR processing.

A final caveat: while this is a very interesting discussion, I will not say a single word here about whether to perform OCR as an in-house activity or via out-sourcing. My general assumption is that below considerations can provide useful information for both scenarios.

1.    Know your material

The more you know about the material / collection you are aiming to OCR, the better. Some characteristics are essential for the configuration of the OCR, like e.g. the language of a document and the fonts (Antiqua, Gothic, Cyrillic, etc.) present. While such information is typically not available in library catalogues, sending documents in French language to an OCR engine configured to recognize English will yield equally poor results as trying to OCR a Gothic typeface with Antiqua settings.

Fortunately there are some helpful tools available – e.g. Apache Tika can detect the language of a document quite reliably. You may consider running such or similar characterization software in a pre-processing step to gather additional information about the content for a more fine-grained configuration of the OCR software.

Some more features in the running text the presence and frequency of which could influence your OCR setup are: tables and illustrations, paragraphs with rotated text, handwritten annotations, foldouts.

2.    Capture high quality – INPUT

Once you are ready to proceed to the image capture step it is important to think about how to set this up. While recent experiments have shown that (on simple documents) there is no apparent loss in recognition quality from using e.g. compressed JPEG images for OCR, my recommendation still remains to scan with the highest optical resolution (typically 300 or 400 ppi) and store the result in an uncompressed format like TIFF or PNG (or even the RAW data directly from the scanner).

While this may result in huge files and storage costs (btw, did you know that the cost per GB of hard drive space drop by 48% every year?), keep in mind that any form of post-processing or compression does essentially reduce the amount of information available in the image for subsequent processing – and it turns out that OCR engines are becoming more and more sophisticated in using this information (e.g. colour) to improve recognition. However, once gone, this information can never be retrieved again without rescanning. If you binarize (=convert to black-and-white) your images immediately after scanning, you won’t be able to leverage the benefits of the next-generation OCR system that requires greyscale or colour documents.

It may also be worthwhile mentioning that while this has never been made very explicit, the classifiers in many OCR engines are optimized for an optical resolution of 300 ppi, and deliver the best recognition rates with documents in that particular resolution. Only in the case of very small characters (as e.g. found on large newspaper pages) can it make sense to scale the image up to 600 ppi for better OCR results.

3.    Capture high quality – OUTPUT

OCR is still a costly process – from preparation to execution, costs can easily amount to between .5 up to .50 € per page. Thus you want to make sure that you derive the most possible value from it. Don’t be satisfied with plain text only! Nowadays some form of XML with (at least) basic structuring and most importantly positional information on the level of blocks / regions, or even better line and word or sometimes even glyph level, should always be available after OCR. ALTO is one commonly used standard for representing such information in an XML format, but also TEI or other XML-based formats can be a good choice.

Not only does the coordinate information enable greatly enhanced search and display of search results (hit term highlighting), there are also many further application scenarios such as the automated generation of table of contents, the production of eBooks, the presentation on mobile devices etc. that rely heavily on structural and layout information being available from OCR processing.

4.    Manage expectations

No matter how modern and in pristine condition your documents are, or whether you use the most advanced scanning equipment and highly configured OCR software, it is quite unrealistic to expect anything more than 90 – 95 % word accuracy from automatic processing. Most of the times though you will be happy to even come anywhere near that range.

Note that most commercial OCR engines calculate error rates based on characters and not words. This can be very misleading, since users will want to search for words. Given there are only 30 errors across a single page with 3000 characters, the character error rate (30/3000, 0,01%) seems exceptionally low. But now assume the 3000 characters boil down to only around 600 words – and the 30 erroneous characters are well distributed across different words. We arrive at an actually much higher (5x) error rate (30/600, 0,05%). To make things worse, OCR engines typically report a “confidence score” in the output. This however only means that the software believes with a certain threshold to have recognized a character or word correctly or incorrectly. These “assumptions”, despite conservative, are unfortunately often found not to be true. That is why the only possible way to derive absolutely reliable OCR accuracy scores is by the use of ground truth-driven evaluation, which is expensive and cumbersome to perform.

Obviously all of this has implications on the quality of any service based on the OCR result. These issues must be made transparent to the organization, and should in all cases also be communicated to the end user.

5.    Exploit full text to the fullest

Once you derive full text from OCR processing, it can be the first stepping stone for a wide array of further enhancements of your digital collection. Life does not stop with (even good) OCR results!

Full text gives you the ability to exploit a multitude of tools for natural language processing (NLP) on the content. Named entity recognition, topic modelling, sentiment analysis, keyword extraction etc. are just a few of the possibilities to further refine and enrich the full text.

6.    Tailor the workflow

The enemy of large-scale automated processing, it can nevertheless often be worthwhile investing some more time and tailor the OCR processing flow to the characteristics of the source material. There are highly specialized modules and engines for particular pre- and post-processing tasks, and integrating these with your workflow for a very particular subset of a collection can often yield surprising improvements in the quality of the result.

7.    Use all available resources

One of the important findings of the IMPACT project was that the use of additional language technologies can boost OCR recognition by an amount than cannot realistically be expected from even major breakthroughs in pattern recognition algorithms. Especially in dealing with historical material there is a lot of spelling variation, and it gets extremely difficult for the OCR software to correctly detect these old words. Making the OCR software aware of historical spelling by supplying it with a historical dictionary or word list can deliver dramatic improvements here. In addition, new technologies can detect valid historical spelling variants and distinguish them from common OCR errors. This makes it much quicker and easier to correct those OCR mistakes while retaining the proper historical word forms (i.e. no normalization is applied).

8.    Try out different solutions

There is a surprisingly large number of OCR software available, both freely and commercially. The Succeed project compiled information about all OCR and related software tools in a huge database that you can search here.

Also quite useful in this are the IMPACT Framework and Demonstrator Platform – these tools allow you to test different solutions for OCR and related tasks online, or even combine distinct tools into comprehensive document recognition workflows and compare those using samples of the material you have to process.

9.    Consult experts

All over the world people are applying, researching and sometimes re-inventing OCR technology. The IMPACT Centre of Competence provides a great entry point to that community. eMOP is another large OCR project currently run in the US. Consult with the community to find out about others who may have done projects similar to yours in the past and who can share findings or even technology.

Finally, consider visiting one of the main conferences in the field, such as ICDAR or ICPR and look at the relevant journal publications by IAPR etc. There is also a large community of OCR and pattern recognition experts in the Biosciences, e.g. in iDigBioHackathons like for example the ones organized by Succeed can provide you with hands-on experience with the tools and technologies being available for OCR.

10.    Consider post-correction

When all other things fail and you just can’t obtain the desired accuracy using automated processing methods, post-correction is often the only possible way to increase the quality of the text to a level suitable for scientific study and text mining. There are many solutions offered to adopt OCR post-correction, from simple-to-use crowdsourcing efforts to rather specialized tools for experts. Gamification of OCR correction has also been explored by some. And as a side effect you may also learn to interact more closely with your users and understand their needs.

With this I hope to have given you some points to take into consideration when planning your next OCR project and wish you much success in doing so. If you would like to comment on any of the points mentioned or maybe share your personal experience with an OCR project, we would be very happy to hear from you!

1st Succeed hackathon @ KB

Throughout recent weeks, rumors spread at KB National Library of the Netherlands that there would be a party of programmers coming to the library to participate in a so-called “hackathon”. In the beginning, especially the IT department was rather curious: will we have to expect port scans being done from within the National Library’s network? Do we need to apply special security measures? Fortunately, none of that was necessary.

A “hackathon” is nothing to be afraid of, normally. On the contrary: the informal gatherings of software developers to work collaboratively on creating and improving new or existing software tools and/or data have emerged as a prominent pattern in recent years – in particular the hack4Europe series of hack days that is organized by Europeana has shown that this model can also be successfully applied in the context of cultural heritage digitization.

After that was sorted, a network switch with static IP addresses was deployed by the facilities department of the KB, thereby ensuring that participants of the event had a fast and robust internet connection at all times and allowing access to the public parts of the internet and the restricted research infrastructure of the KB at the same time – which received immediate praise from the hackers. Well done, KB!

So when the software developers from Austria, England, France, Poland, Spain and the Netherlands gathered at the KB last Thursday, everyone already knew they were indeed here to collaboratively work on one of the European projects the KB is involved in: the Succeed project. The project had called in software developers from all over Europe to participate in the 1st Succeed hackathon to work on interoperability of tools and workflows for text digitization.

There was a good mix of people from the digitization as well as digital preservation communities, with some additional Taverna expertise tossed in. While about half of the participants had participated in either Planets, IMPACT or SCAPE, the other half of them were new to the field and eager to learn about the outcomes of these projects and how Succeed will address them.

And so after some introduction followed by coffee and fruit, the 15 participants immersed straight away into the various topics that were suggested prior to the event as needing attention. And indeed, the results that were presented by the various groups after 1.5 days (but only 8 hours of effective working time) were pretty impressive…

hack
Hackers at work @ KB Succeed hackathon

The developers from INL were able to integrate some of the servlets they created in IMPACT and Namescape with the interoperability-framework – although also some bugs were uncovered while doing so. They will be fixed asap, rest assured!  Also, with the help of the PSNC digital libraries team, Bob and Jesse were able to create a small training set for Tesseract, outperforming the standard dictionary despite some problems that were found in training Tesseract version 3.02. Fortunately it was possible to apply the training to version 3.0and then run the generated classifier in Tesseract version 3.02, which is the current stable(?) release.

Even better: the colleagues from Poznań (who have a track record of successful participation at hackathons) had already done some training with Tesseract earlier and developed some supporting tools for it. Quickly Piotr created a tool description for the “cutouts” tool that automatically creates binarized clippings of characters from a source image. On the second day another feature of the cutouts application was added: creating an artificial image suitable for training Tesseract from the binarized character clippings. When finally wrapping the two operations in a Taverna workflow time eventually ran out, but given only little work remained we look forward to see the Taverna workflow for Tesseract training becoming available shortly! Certainly this is also of interest to the eMOP project in the US, in which the KB is a partner as well.

Meanwhile, another colleague from Poznań was investigating the process of creating packages for Debian-based Linux operating systems from existing (open source) tools. And despite using a laptop with OSX Mountain Lion, Tomasz managed to present a valid Debian package (including even icon and man page) – kudos! Certainly the help of Carl from the Open Planets Foundation was also partly to blame for that…next steps will include creating a change log straight off github. To be continued!

psnc
Two colleagues from PSNC-dl working on a Tesseract training workflow

Another group attending the event were the team from LITIS lab at the University of Rouen. Thierry demonstrated the newest PLaIR tools such as the newspaper segmenter capable of automatically separating articles in scanned newspaper images.  The PLaIR tools use GEDI as the encoding format, so some work was immediately invested by David to also support the PAGE format, the predominant format for document encoding used in the IMPACT tools, thereby in principle establishing interoperability between IMPACT and PLaIR applications. In addition, since the PLaIR tools are mostly already available as web services, Philippine started with creating Taverna workflows for these methods. We look forward to complement the existing IMPACT workflows with those additional modules from PLaIR!

plairScreenshot of the PLaIR system for post-correction of newspaper OCR

All this was done without requiring any help from the PRImA group at the University of Salford, Greater Manchester, who are maintaining the PAGE format and a number of tools to support it. So with some free time on his hand, Christian from PRImA instead had a deeper look at Taverna and the PAGE serialization of the recently released open source OCR evaluation tool from the University of Alicante, the technical lead of the Centre of Competence, and found it to be working quite fine. Good to finally have an open source community tool for OCR evaluation with support for PAGE – and more features shall be added soon: we’re thinking word accuracy rate, bag-of-words evaluation and more – send us your feature requests (or even better: pull request).

We were particularly glad also that some developers beyond the usual MLA community suspects have found the way to the KB on those 2 days: a team from the Leiden University Medical Centre was also attending, keen on learning how they could use the T2-Client for their purposes. Initially slowed down by some issues encountered in deploying Taverna 2 Server on a Windows machine (don’t do it!), eventually Reinout and Eelke were able to resolve it simply by using Linux instead. We hope a further collaboration of Dutch Taverna users will arise from this!

Besides all the exciting new tools and features it was good to also see some others getting their hands dirty with (essential) engineering tasks – work progressed well on several issues from the interoperability-framework’s issue tracker: support for output directories is close to being fully implemented thanks to Willem Jan, and a good start was made on future MTOM support. Also Quique from the Centre of Competence was able to improve the integration between IMPACT services and the website Demonstrator Platform.

Without the help of experienced developers Carl from the Open Planets Foundation and Sven from the Austrian National Library (who had just conducted a training event for the SCAPE project earlier in the same week in London, and quickly decided to cross the channel for yet one more workshop), this would not have been so easily possible. While Carl was helping out everywhere at once, Sven found some time to fit in a Taverna training session after lunch on Friday, which was hugely appreciated from the audience.

sven
Sven Schlarb from the Austrian National Library delivering Taverna training

After seeing all the powerful capabilities of Taverna in combination with the interoperability-framework web services and scripts in a live demo, no one needed further reassurance that it was well worth spending the time to integrate this technology and work with the interoperability-framework and it’s various components.

Everyone said they really enjoyed the event and found plenty of valuable things that they had learned and wanted to continue working with. So watch out for the next Succeed hackathon in sunny Alicante next year!

KB joins the leading Big Data conference in Europe!

hadoopsummitOn March 20-21, Hadoop Summit 2013, the leading big data conference, made its first ever appearance on European soil. The Beurs van Berlage in Amsterdam provided a splendid venue for the gathering of about 500 international participants interested in the newest trends around Big Data and Hadoop. The main hosts Hortonworks and Yahoo did an excellent job in putting together an exciting programme with two days full of enticing sessions divided by four distinct tracks: Applied Hadoop, Operating Hadoop, Hadoop Futures and Integrating Hadoop.

audienceHadoop Summit 2013, © http://www.flickr.com/photos/timoelliott/

The open-source Hadoop software framework allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale out from single servers to thousands of machines.

In his keynote, Hortonworks VP Shaun Connolly’s pointed out that already more than half the world’s data will be processed using Hadoop in 2015! Further on, there were keynotes by 451 Research Director Matt Aslett (What is the point of Hadoop?), Hortonworks founder and CEO Eric Baldeschwieler (Hadoop Now, Next and Beyond) and a live panel that discussed Real-World insight into Hadoop in the Enterprise.

vendorsVendor area at Hadoop Summit 2013, © http://www.flickr.com/photos/timoelliott/

Many interesting talks followed on the use and benefit derived from Hadoop at companies like Facebook, Twitter, Ebay, LinkedIn and alike, as well as on exciting upcoming technologies further enriching the Hadoop ecosystem such as Apache projects Drill, Ambari or the next-generation MapReduce implementation YARN.

The Koninklijke Bibliotheek and the Austrian National Library jointly presented their recent experiences with Hadoop in the SCAPE project. Clemens Neudecker and Sven Schlarb spoke about the potential of integrating Hadoop into digital libraries in their talk “The Elephant in the Library” (video: coming soon).


In the SCAPE project partners are experimenting with integrating Hadoop into library workflows for different large-scale data processing scenarios related to web archiving, file format migration or analytics – you can find out more about the Hadoop related activities in SCAPE here: 
http://www.scape-project.eu/news/scape-hadoop.

After two very successful days the Hadoop Summit concluded and participants agreed there needs to be another one next year – likely again to be held in the amazing city of Amsterdam!

Find out more about Hadoop Summit 2013 in Amsterdam:

Web:             http://hadoopsummit.org/amsterdam/
Facebook:    https://www.facebook.com/HadoopSummit
Pictures:      http://www.flickr.com/photos/timoelliott/
Tweets:       https://twitter.com/search/?q=hadoopsummit
Slides:          http://www.slideshare.net/Hadoop_Summit/
Videos:        http://www.youtube.com/user/HadoopSummit/videos
Blogs:           http://hortonworks.com/blog/hadoop-summit-2013-amsterdam-its-a-wrap/
                     http://www.sentric.ch/blog/hello-europe-hadoop-has-landed
                     http://janbruecher.blogspot.nl/2013/03/2013-hadoop-summit-day-1.html
                     http://janbruecher.blogspot.nl/2013/03/2013-hadoop-summit-day-2.html

IMPACT across the pond

IDHMC-Header-2cropped360EMOPlogo(withBackground)

Large amounts of historical books and documents are continuously being brought online through the many mass digitisation projects in libraries, museums and archives around the globe. While the availability of digital facsimiles already made these historical collections much more accessible, the key to unlock their full potential for scholarly research is making these documents fully searchable and editable – and this is still a largely problematic process.

During 2007 – 2012 the Koninklijke Bibliotheek coordinated the large-scale integrating project IMPACT – Improving Access to Text that explored different approaches to innovate OCR technology and significantly lowered the barriers that stand in the way of the mass digitisation of the European cultural heritage. The project concluded in June 2012 and led to the conception of the impact Centre of Competence in Digitisation.

texas-a-m-university-campus-in-college-station_slide

Texas A&M University campus, home of the “Aggies”

The Early Modern OCR Project (eMOP) is a new project established by the Initiative for Digital Humanities, Media and Culture at Texas A&M University with funding from the Andrew W. Mellon Foundation that will run from October 2012 through September 2014. The eMOP project draws upon the experiences and solutions from IMPACT to create technical resources for improving OCR for early modern English texts from Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) in order to make them available to scholars through the Advanced Research Consortium (ARC). The integration of post-correction and collation tools will enable scholars of the early modern period to exploit the more than 300,000 documents to their full potential. Already now the eMOP Zotero library is the place to find anything you ever wanted to know about OCR and  related technologies.

A4Oj7EUCUAEoNs7

eMOP is using the Aletheia tool from IMPACT partner PRImA to create ground truth for  the historical texts

MELCamp 2013 now provided a good opportunity to gather some of the technical collaborators on the eMOP project, like Clemens Neudecker from the Koninklijke Bibliotheek and Nick Laiacona from Performant Software for a meeting in College Station, Texas with the eMOP team at the IDHMC. Over the course of 25 – 28 March lively discussions evolved around finding the ideal setup for training the open-source OCR engine Tesseract to recognise English from the early modern period, fixing line segmentation in Gamera (thanks to Bruce Robertson), the creation of word frequency lists for historical English, and the question of how to combine all the various processing steps in a simple to use workflow using the Taverna workflow system.

A tour of Cushing Memorial Library and Archives with its rich collection of early prints and the official repository for George R.R. Martin’s writings wrapped up a nice and inspiring week in sunny Texas – to be continued!

Find out more about the Early Modern OCR project:

Web:                http://emop.tamu.edu/
Wiki:                http://emopwiki.tamu.edu/index.php/Main_Page
Video:              http://idhmc.tamu.edu/projects/Mellon/why.html
Blog:                http://emop.tamu.edu/blog