Author: Tineke Koster
As I am writing this, volunteers are rekeying our 17th century newspapers articles. Optical character recognition of the gothic text type in use at the time has yielded poor results, making this part of our digital collection nearly inaccessible for full-text search. The Meertens institute, who have an excellent track record when it comes to crowdsourcing, has developed the editor (Dutch). Together with them we are working towards a full update of all newspaper issues from 1618 to 1700 that are available in our website Delpher.
Great news and, for some researchers, an eagerly awaited development. A bright future beckons in which our digital text corpus is 100% correct, just waiting to be mined for dynamic phenomena and paradigm shifts.
But we have to realize that without the proper precautions, correcting digital texts may also hinder researchers in their work. How so? These texts may have been used (browsed, mined, cited, etc.) by researchers in their earlier form. The improvement or enrichment may have consequences for the reproducibility of their research results.
For all researchers the need to reproduce research results is growing, with new guidelines due to new laws. There is also a specific group of researchers that need sustained access to older versions of digital text. The need is highest for research where the goal is to develop an algorithm and to assess its quality relative to previous versions of the same algorithm or to other algorithms. Without sustained access to older versions, these people cannot do their work.
Is it our role to provide this access? How the National Library of the Netherlands is thinking about this issue, I hope to explain in a later blogpost (soon!). Meanwhile, I would be very interested to hear your experiences. How is this subject discussed in your organization? Does your organization have a policy in place to deal with this?