OCR improvement: helping and hindering researchers

Author: Tineke Koster

As I am writing this, volunteers are rekeying our 17th century newspapers articles. Optical character recognition of the gothic text type in use at the time has yielded poor results, making this part of our digital collection nearly inaccessible for full-text search. The Meertens institute, who have an excellent track record when it comes to crowdsourcing, has developed the editor (Dutch). Together with them we are working towards a full update of all newspaper issues from 1618 to 1700 that are available in our website Delpher.

Great news and, for some researchers, an eagerly awaited development. A bright future beckons in which our digital text corpus is 100% correct, just waiting to be mined for dynamic phenomena and paradigm shifts.

But we have to realize that without the proper precautions, correcting digital texts may also hinder researchers in their work. How so? These texts may have been used (browsed, mined, cited, etc.) by researchers in their earlier form. The improvement or enrichment may have consequences for the reproducibility of their research results.

For all researchers the need to reproduce research results is growing, with new guidelines due to new laws. There is also a specific group of researchers that need sustained access to older versions of digital text. The need is highest for research where the goal is to develop an algorithm and to assess its quality relative to previous versions of the same algorithm or to other algorithms. Without sustained access to older versions, these people cannot do their work.

Is it our role to provide this access? How the National Library of the Netherlands is thinking about this issue, I hope to explain in a later blogpost (soon!). Meanwhile, I would be very interested to hear your experiences. How is this subject discussed in your organization? Does your organization have a policy in place to deal with this?

9 thoughts on “OCR improvement: helping and hindering researchers

  1. We are not at all there yet, but we plan on storing text change deltas all the way back to the original OCR. As these will be time stamped, making it possible to specify a date in the API-request for a record should solve the problem of reproducibility.

    • Thank you for sharing this. Your situation should be quite similar to ours, so I would like to know a little bit more.
      Do you store only changes to (OCR) text? What about changes to images? Or metadata?
      And when storing delta’s, is your purpose reproducibility of research results specifically, or do you have other reasons as well?

  2. Interesting issues, and not one I’d previously considered.

    I discussed it with my wife, who is closer to academic researchers than I am, and she made the following points.

    Have you asked your researchers how important this capability is to them? You could then decide whether the value justifies the cost. It may be that the researchers have taken local copies of the corpus precisely to ensure reproducibility anyway.

    You could also take a snapshot of the corpus before beginning the rekeying project. This would take care of the reproducibility of past research. Current research where reproducibility was of concern could use the snapshot until the rekeying project was complete. This would avoid the cost of deltas.

  3. I am a researcher working with older texts and I wouldn’t want to have the text “corrected” or “improved” since the data has to be as faithful as possible–what appears to be a mistake or a typo to an encoder might not be one at all for a researcher and you never know what a researcher will want to do with any given text. So, no tampering would be ideal. Even better, if you encode the text in XML, it is easy to have the best of both worlds by encoding the corrections or standardized spellings together with the original unamended text.

    • I think there’s a misunderstanding here. Rekeying (ie, transcription) is being carried out precisely in order to make currently inaccurate data as faithful as possible to the original texts.

      • Oops! Right, my mistake. That’s indeed quite a different issue. In that case, I think that it might be a good idea to have some kind of fine-grained version control system, maybe something à la Wikipedia or even software like Git or CVS. There are several alternatives that are free and fairly easy to implement depending on your current system. At worst, only a handful of users will benefit from a technician’s couple of days work, but who knows what future users might want to do with the data?

  4. Our situation at Old Bailey Online is not quite like yours – we make periodic corrections to errors in the data on the site but they are generally very minor. Nonetheless, sometimes we’ve done more substantial upgrades that could have more impact. What we do is to use version numbers like those of software releases, and since about 2008 we’ve kept an archive on the site – akin to a Changelog – that outlines exactly when previous changes took place and makes a particular note of any substantial amendments. We don’t make superseded versions publicly accessible but we keep copies (on a backed up server!), and we’d supply the files if anyone needed it.

    http://www.oldbaileyonline.org/static/Whats-new-archive.jsp
    http://www.oldbaileyonline.org/static/Whats-new.jsp

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s