Large amounts of historical books and documents are continuously being brought online through the many mass digitisation projects in libraries, museums and archives around the globe. While the availability of digital facsimiles already made these historical collections much more accessible, the key to unlock their full potential for scholarly research is making these documents fully searchable and editable – and this is still a largely problematic process.
During 2007 – 2012 the Koninklijke Bibliotheek coordinated the large-scale integrating project IMPACT – Improving Access to Text that explored different approaches to innovate OCR technology and significantly lowered the barriers that stand in the way of the mass digitisation of the European cultural heritage. The project concluded in June 2012 and led to the conception of the impact Centre of Competence in Digitisation.
Texas A&M University campus, home of the “Aggies”
The Early Modern OCR Project (eMOP) is a new project established by the Initiative for Digital Humanities, Media and Culture at Texas A&M University with funding from the Andrew W. Mellon Foundation that will run from October 2012 through September 2014. The eMOP project draws upon the experiences and solutions from IMPACT to create technical resources for improving OCR for early modern English texts from Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) in order to make them available to scholars through the Advanced Research Consortium (ARC). The integration of post-correction and collation tools will enable scholars of the early modern period to exploit the more than 300,000 documents to their full potential. Already now the eMOP Zotero library is the place to find anything you ever wanted to know about OCR and related technologies.
eMOP is using the Aletheia tool from IMPACT partner PRImA to create ground truth for the historical texts
MELCamp 2013 now provided a good opportunity to gather some of the technical collaborators on the eMOP project, like Clemens Neudecker from the Koninklijke Bibliotheek and Nick Laiacona from Performant Software for a meeting in College Station, Texas with the eMOP team at the IDHMC. Over the course of 25 – 28 March lively discussions evolved around finding the ideal setup for training the open-source OCR engine Tesseract to recognise English from the early modern period, fixing line segmentation in Gamera (thanks to Bruce Robertson), the creation of word frequency lists for historical English, and the question of how to combine all the various processing steps in a simple to use workflow using the Taverna workflow system.
A tour of Cushing Memorial Library and Archives with its rich collection of early prints and the official repository for George R.R. Martin’s writings wrapped up a nice and inspiring week in sunny Texas – to be continued!
Find out more about the Early Modern OCR project:
Web: http://emop.tamu.edu/
Wiki: http://emopwiki.tamu.edu/index.php/Main_Page
Video: http://idhmc.tamu.edu/projects/Mellon/why.html
Blog: http://emop.tamu.edu/blog