Take-up of tools within the Succeed project: Implementation of the INL Lexicon Service in Delpher.nl

Author: Judith Rog

Delpher

Delpher is a joint initiative of the Meertens Institute of Research and documentation of Dutch language and culture, the university libraries of Amsterdam (UvA), Groningen, Leiden and Utrecht, and the National Library of the Netherlands, to bring together the otherwise fragmented access to digitized historical text corpora.

Delpher currently contains over 90.000 books, over 1 million newspapers, containing more than 10 million pages, over 1.5 million pages from periodicals, and 1.5 million ANP news bulletins that are all full text searchable. New content will continually be added in the coming years.

Continue reading

The Research Data Alliance in Amsterdam and the KB

Prof. C. Borgman Image: Inge Angevaare KB-NL

Our colleagues from DANS organized the 4th Plenary Meeting of the Research Data Alliance ( RDA: research data sharing without barriers) in Amsterdam, held this past three days. I was there, representing the KB, one of the few national libraries present. The concept that national libraries have “research data” is a concept that needs some explanation. There are repositories that collect data sets that are a result of research, often underpinning an article. DANS and 3TU are good examples of this. But there are also repositories that have “collections” to facilitate research, like sensor data, astronomical data, climate data. This is similar to what the KB offers the researchers: a vast amount of digitized historical texts and a (restricted accessible) web archive. Researchers use these sets, see for example the Webart project. With the growing attention for digital scholarship or e-Humanities, we can expect more use. And to make the process complete, the results of research done on KB collections might end up as a publication in the KB and a data set at DANS. An NCDD working group on Enhanced Publications is looking into ways to present both outputs smoothly as an integral entity to the user. In short, there are good reasons for libraries to be at RDA! The opening of the conference had several speakers from the European Commission. Both Robert Jan Smits (Director General DG Research) and Neelie Kroes, Vice president of the European Commision via video, stressed that the European Commission expects RDA to contribute to the growing importance of sharing and preserving research data, as open access to research data is a key message in the Horizon 2020 Programme. With a new cohort of EU politicians, some canvassing work to convince them of the ins and outs of this and the role of RDA will be necessary. Prof. dr. Barend Mons from Leiden University and founder of the “fair data” initiative was asked to give his views on the matter. FAIR data being: Findable, Accessible Interoperable and Re-usable, for both humans and computers. With the motto “Bringing data to Broadway” he pleaded for professionalism in data publishing by a good infrastructure for data and a rewarding system for researchers (data should have the same “status” as a publication) and for real data stewardship. Difficulties in hiring and keeping competent data scientists, for example are a barrier. Are publishers ready for data publishing or will the data end up in a black hole? Despite the trend of putting data central, he believes that there will always be “a narrative” explaining the findings (read: articles, books). To improve professional data stewardship, he pleaded to reserve 5% of research budgets to achieve the goals of FAIR data. Prof. Christine Borgman of UCLA gave an interesting talk in which she criticized some assumptions related to research data. For example data sharing: this is not common practice in every discipline and (again) as long as researchers are not rewarded for it, it will not happen. The emphasis on data might not be fair, publications are not simply “containers for data” but are arguments, supported by the data. The carefully designed process in publications (for example the order of appearance of the authors) is not even designed yet for data sets. More of this will be described in her new book, to be published by the end of the year. The rest of the work these days was done in a variety of Interest Groups (IGs) and Working Groups (WGs). The KB participates in the activities on Certification of Digital Repositories and Data Publishing (about workflows in data publishing, of interest for our (inter-) national e-Depot, about costs for data centers). All information is available from the RDA website. At the final meeting an interesting announcement was made: in December a follow up of the Riding the Wave report will be published, with the working title Harvesting the Data. Knowing the immense impact the Riding the Wave report had, this is something to look forward to. The Research Data Alliance started as a small group and has now over 2500 members, with a large range of Interest Groups and Working Groups. Time has come to streamline the activities more in order to integrate the results and to think about the sustainability of the RDA itself. The results of this process will be discussed in the next Plenary Meeting in San Diego 9-11 March 2015.

How to maximise usage of digital collections

Libraries want to understand the researchers who use their digital collections and researchers want to understand the nature of these collections better. The seminar ‘Mining digital repositories’ brought them together at the Dutch Koninklijke Bibliotheek (KB) on 10-11 April, 2014, to discuss both the good and the bad of working with digitised collections – especially newspapers. And to look ahead at what a ‘digital utopia’ might look like. One easy point to agree on: it would be a world with less restrictive copyright laws. And a world where digital ‘portals’ are transformed into ‘platforms’ where researchers can freely ‘tinker’ with the digital data. – Report & photographs by Inge Angevaare, KB.

Mining Digital Repositories Conference 2014

Hans-Jorg Lieder of the Berlin State Library (front left) is given an especially warm welcome by conference chair Toine Pieters (Utrecht), ‘because he was the only guy in Germany who would share his data with us in the Biland project.’

Libraries and researchers: a changing relationship

‘A lot has changed in recent years,’ Arjan van Hessen of the University of Twente and the CLARIN project told me. ‘Ten years ago someone might have suggested that perhaps we should talk to the KB. Now we are practically in bed together.’

But each relationship has its difficult moments. Researchers are not happy when they discover gaps in the data on offer, such as missing issues or volumes of newspapers. Or incomprehensible transcriptions of texts because of inadequate OCR (optical character recognition). Conference organisers Toine Pieters and Jaap Verheul (University of Utrecht) invited Hans-Jorg Lieder of the Berlin State Library to explain why he ‘could not give researchers everything everywhere today’.

Lieder & Thomas: ‘Digitising newspapers is difficult’

Both Deborah Thomas of the Library of Congress and Hans-Jorg Lieder stressed how complicated it is to digitise historical newspapers. ‘OCR does not recognise the layout in columns, or the “continued on page 5”. Plus the originals are often in a bad state – brittle and sometimes torn paper, or they are bound in such a way that text is lost in the middle. And there are all these different fonts, e.g., Gothic script in German, and the well-known long-s/f confusion.’ Lieder provided the ultimate proof of how difficult digitising newspapers is: ‘Google only digitises books, they don’t touch newspapers.’

Mining Digital Repositories Damaged Newspapers

Thomas: ‘The stuff we are digitising is often damaged’

Another thing researchers should be aware of: ‘Texts are liquid things. Libraries enrich and annotate texts, versions may differ.’ Libraries do their best to connect and cluster collections of newspapers (e.g., in the Europeana Newspapers), but ‘the truth of the matter is that most newspapers collections are still analogue; at this moment we have only bits and pieces in digital form, and there is a lot of bad OCR.’ There is no question that libraries are working on improving the situation, but funding is always a problem. And the choices to be made with bad OCR are sometimes difficult: Should we manually correct it all, or maybe retype it, or maybe even wait a couple of years for OCR technology to improve?’

Mining Digital Repositories Conference Claeyssens Van Hessen Kenter

Librarians and researchers discuss what is possible and what not. From the left, Steven Claeyssens, KB Data Services, Arjan van Hessen, CLARIN, and Tom Kenter, Translantis.

Researchers: how to mine for meaning

Researchers themselves are debating how they can fit these new digital resources into their academic work. Obviously, being able to search millions of newspaper pages from different countries in a matter of days opens up a lot of new research possibilities. Conference organisers Toine Pieters and Jaap Verheul (University of Utrecht) are both involved in the HERA Translantis project which is taking a break from traditional ‘national’ historical research by looking at transnational influences of so-called ‘reference cultures’:

Mining digital repositories - Definition of reference cultures

Definition of Reference Cultures in the Translantis project which mines digital newspaper collections

In the 17th century the Dutch Republic was such a reference culture. In the 20th century the United States developed into a reference culture and Translantis digs deep into the digital newspaper archives of the Netherlands, the UK, Belgium and Germany to try and find out how the United States is depicted in public discourse:

Mining Digital Repositories Jaap Verheul Translantis

Jaap Verheul (Translantis) shows how the US is depicted in Dutch newspapers

Joris van Eijnatten introduced another transnational HERA project, ASYMENC, which is exploring cultural aspects of European identity with digital humanities methodologies.

All of this sounds straightforward enough, but researchers themselves have yet to develop a scholarly culture around the new resources:

  • What type of research questions do the digital collections allow? Are these new questions or just old questions to be researched in a new way?
  • What is scientific ‘proof’ if the collections you mine have big gaps and faulty OCR?
  • How to interpret the findings? You can search words and combinations of words in digital repositories, but how can you assess what the words mean? Meanings change over time. Also: how can you distinguish between irony and seriousness?
  • How do you know that a repository is trustworthy?
  • How to deal with language barriers in transnational research? Mere translations of concepts do not reflect the sentiment behind the words.
  • How can we analyse what newspapers do not discuss (also known as the ‘Voldemort’ phenomenon)?
  • How sustainable is digital content? Long-term storage of digital objects is uncertain and expensive. (Microfilms are much easier to keep, but then again, they do not allow for text mining …)
  • How do available tools influence research questions?
  • Researchers need a better understanding of text mining per se.

Some humanities scholars have yet to be convinced of the need to go digital

Rens Bod, Director of the Dutch Centre for Digital Humanities enthusiastically presented his ideas about the value of parsing (analysing parts of speech) for uncovering deep patterns in digital repositories. If you want to know more: Bod recently published a book about it.

Rens Bod

Professor Rens Bod: ‘At the University of Amsterdam we offer a free course in working with digital data.’

But in the context of this blog, his remarks about the lack of big data awareness and competencies among many humanities scholars, including young students, was perhaps more striking. The University of Amsterdam offers a crash course in working with digital data to bridge the gap. The one-week, free course, deals with all aspects of working with data, from ‘gathering data’ to ‘cooking data’.

As the scholarly dimensions of working with big data are not this blogger’s expertise, I will not delve into these further but gladly refer you to an article Toine Pieters and Jaap Verheul are writing about the scholarly outcomes of the conference [I will insert a link when it becomes available].

Mining Digital Repositories Jaap Verheul Toine Pieters

Conference hosts Jaap Verheul (left) and Toine Pieters taking analogue notes for their article on Mining Digital Repositories. And just in case you wonder: the meeting rooms are probably the last rooms in the KB to be migrated to Windows 7

More data providers: the ‘bad’ guys in the room

It was the commercial data providers in the room themselves that spoke of ‘bad guys’ or ‘bogey man’ – an image both Ray Abruzzi of Cengage Learning/Gale and Elaine Collins of DC Thomson Family History were hoping to at least soften a bit. Both companies provide huge quantities of digitised material. And, yes, they are in it for the money, which would account for their bogeyman image. But, they both stressed, everybody benefits from their efforts:

Value proposition of DC Thomson Family History

Value proposition of DC Thomson Family History

Cengage Learning is putting 25-30 million pages online annually. Thomson is digitising 750 million (!) newspaper & periodical pages for the British Library. Collins: ‘We take the risk, we do all the work, in exchange for certain rights.’ If you want to access the archive, you have to pay.

In and of itself, this is quite understandable. Public funding just doesn’t cut it when you are talking billions of pages. Both the KB’s Hans Jansen and Rens Bod (U. of Amsterdam) stressed the need for public/private partnerships in digitisation projects.

And yet.

Elaine Collins readily admitted that researchers ‘are not our most lucrative stakeholders’; that most of Thomson’s revenue comes from genealogists and the general public. So why not give digital humanities scholars free access to their resources for research purposes, if need be under the strictest conditions that the information does not go anywhere else? Both Abruzzi and Collins admitted that such restricted access is difficult to organise. ‘And once the data are out there, our entire investment is gone.’

Libraries to mediate access?

Perhaps, Ray Abruzzi allowed, access to certain types of data, e.g., metadata, could be allowed under certain conditions, but, he stressed, individual scholars who apply to Cengage for access do not stand a chance. Their requests for data are far too varied for Cengage to have any kind of business proposition. And there is the trust issue. Abruzzi recommended that researchers turn to libraries to mediate access to certain content. If libraries give certain guarantees, then perhaps …

Mining Digital Repositories Toine Pieters

You think OCR is difficult to read? Try human handwriting!

What do researchers want from libraries?

More data, of course, including more contemporary data (… ah, but copyright …)

And better quality OCR, please.

What if libraries have to choose between quality and quantity?  That is when things get tricky, because the answer would depend on the researcher you question. Some may choose quantity, others quality.

Should libraries build tools for analysing content? The researchers in the room seemed to agree that libraries should concentrate on data rather than tools. Tools are very temporary, and researchers often need to build the tools around their specific research questions.

But it would be nice if libraries started allowing users to upload enrichments to the content, such as better OCR transcriptions and/or metadata.

Mining Digital Repositories 2014

Researchers and libraries discussing what is desirable and what is possible. In the front row, from the left, Irene Haslinger (KB), Julia Noordegraaf (U. of Amsterdam), Toine Pieters (Utrecht), Hans Jansen (KB); further down the front row James Baker (British Library) and Ulrich Tiedau (UCL). Behind the table Jaap Verheul (Utrecht) and Deborah Thomas (Library of Congress).

And there is one more urgent request: that libraries become more transparent in what is in their collections and what is not. And be more open about the quality of the OCR in the collections. Take, e.g., the new Dutch national search service Delpher. A great project, but scholars must know exactly what’s in it and what’s not for their findings to have any meaning. And for scientific validity they must be able to reconstruct such information in retrospect. So a full historical overview of what is being added at what time would be a valuable addition to Delpher. (I shall personally communicate this request to the Delpher people, who are, I may add, working very hard to implement user requests).

American newspapers

Deborah Thomas of the US Library of Congress: ‘This digital age is a bit like the American Wild West. It is a frontier with lots of opportunities and hopes for striking it rich. And maybe it is a bit unruly.’

New to the library: labs for researchers

Deborah Thomas of the Library of Congress made no bones about her organisation’s strategy towards researchers: We put out the content, and you do with it whatever you want. In addition to API’s (Application Protocol Interfaces), the Library is also allowing for downloads of bulk content. The basic content is available free of charge, but additional metadata levels may come at a price.

The British Library (BL) is taking a more active approach. The BL’s James Baker explained how the BL is trying to bridge the gap between researchers and content by providing special labs for researchers. As I (unfortunately!) missed that parallel session, let me mention the KB’s own efforts to set up a KB lab where researchers are invited to experiment with KB data making use of open source tools. The lab is still in its ‘pre-beta phase’ as Hildelies Balk of the KB explained. If you want the full story, by all means attend the Digital Humanities Benelux Conference in the Hague on 12-13 June, where Steven Claeyssens and Clemens Neudecker of the KB are scheduled to launch the beta-version of the platform. Here is a sneak preview of the lab, a scansion machine built by KB Data Services in collaboration with phonologist Marc van Oostendorp (audio in Dutch):

Europeana: the aggregator

“Portals are for visiting; platforms are for building on.”

Another effort by libraries to facilitate transnational research is the aggregation of their content in Europeana, especially Europeana Newspapers. For the time being the metadata are being aggregated, but in Alistair Dunning‘s vision, Europeana will grow from an end-user portal into a data brain, a cloud platform that will include the content and allow for metadata enrichment:

Alistair Dunning: 'Europeana must grow into

Alistair Dunning: ‘Europeana must grow into a data brain to bring disparate data sets together.’

Dunning's vision of Europeana in the future

Dunning’s vision of Europeana 3.0

Dunning also indicated that Europeana might develop brokerage services to clear content for non-commercial purposes. In a recent interview Toine Pieters said that researchers would welcome Europeana to take such a role, ‘because individual researchers should not be bothered with all these access/copyright issues.’ In the United States, the Library of Congress is not contemplating a move in that direction, Deborah Thomas told her audience. ‘It is not our mission to negotiate with publishers.’ And recent ‘Mickey Mouse’ legislation, said to have been inspired by Disney interests, seems to be leading to less rather than more access.

Dreaming of digital utopias

What would a digital utopia look like for the conference attendees? Jaap Verheul invited his guests to dream of what they would do if they were granted, say, €100 million to spend as they pleased.

Deborah Thomas of the Library of Congress would put her money into partnerships with commercial companies to digitise more material, especially the post-1922 stuff (less restrictive copyright laws being part and parcel of the dream). And she would build facilities for uploading enrichments to the data.

James Baker of the British Library would put his money into the labs for researchers.

Researcher Julia Noordegraaf of the University of Amsterdam (heritage and digital culture) would rather put the money towards improving OCR quality.

Joris van Eijnatten’s dream took the Europeana plans a few steps further. His dream would be of a ‘Globiana 5.0’ – a worldwide, transnational repository filled with material in standardised formats, connected to bilingual and multilingual dictionaries and researched by a network of multilingual, big data-savvy researchers. In this context, he suggested that ‘Google-like companies might not be such a bad thing’ in terms of sustainability and standardisation.

Joris van Eijnatten

Joris van Eijnatten: ‘Perhaps – and this is a personal observation – Google-like companies are not such a bad thing after all in terms of sustainability and standardisation of formats.’

At the end of the two-day workshop, perhaps not all of the ambitious agenda had been covered. But, then again, nobody had expected that.

Agenda for Mining Digital Repositories 2014

Mining Digital Repositories 2014 – the ambitious agenda

The trick is for providers and researchers to keep talking and conquer this ‘unruly’ Wild West of digital humanities bit by bit, step by step.

And, by all means, allow researchers to ‘tinker’ with the data. Verheul: ‘There is a certain serendipity in working with big data that allows for playfulness.’

See also:

 

Breaking down walls in digital preservation (Part 2)

Here is part 2 of the digital preservation seminar which identified ways to break down walls between research & development and daily operations in libraries and archives (continued from Breaking down walls in digital preservation, part 1). The seminar was organised by SCAPE and the Open Planets Foundation in The Hague on 2 April 2014. – Report & photographs by Inge Angevaare, visualisations by Elco van Staveren

Ross King picture wall between daily operations and research and development in digital preservation

Ross King of the Austrian Institute of Technology (and of OPF) kicking off the afternoon session by singlehandedly attacking the wall between daily operations and R&D

Experts meet managers

Ross King of the Austrian Institute of Technology described the features of the (technical) SCAPE project which intends to help institutions build preservation environments which are scalable – to bigger files, to more heterogeneous files, to a large volume of files to be processed. King was the one who identified the wall that exists between daily operations in the digital library and research & development (in digital preservation):

The wall between Production and R&D

The Wall between Production & R&D as identified by Ross King

Zoltan Szatucsket of the Hungarian National Archives shared his experiences with one of the SCAPE tools from a manager’s point of view: ‘Even trying out the Matchbox tool from the SCAPE project was too expensive for us.’ King admitted that the Matchbox case had not yet been entirely successful. ‘But our goal remains to deliver tools that can be downloaded and used in practice.’

Maureen Pennock of the British Library sketched her organisation’s journey to embed digital preservation [link to slides to follow]. Her own digital preservation department (now at 6 fte) was moved around a few times before it was nested in the Collection Care department which was then merged with Collection management. ‘We are now where we should be: in the middle of the Collections department and right next to the Document processing department. And we work closely with IT, strategy development, procurement/licensing and collection security and risk management.’

British Library strategy for digital preservation

The British Library’s strategy calls for further embedding of digital preservation, without taking the formal step of certification

Pennock elaborated on the strategic priorities mentioned above (see slides) by noting that the British Library has chosen not to strive for formal certification within the European Framework (unlike, e.g., the Dutch KB). Instead, the BL intends to hold bi-annual audits to measure progress. The BL intends to ensure that ‘all staff working with digital content understand preservation issues associated with it.’ Questioned by the Dutch KB’s Hildelies Balk, Pennock confirmed that the teaching materials the BL is preparing could well be shared with the wider digital preservation community. Here is Pennock’s concluding comment:

Digital preservation is like a bicycle - one size does not fit everyone, but everyone still recognises it as a library

Digital preservation is like a bicycle – one size doesn’t fit everyone … but everybody still recognises the bicycle

Marcin Werla from the Polish Supercomputing & Networking Centre PSNC provided an overview of the infrastructure PSNC is providing for research institutions and for cultural heritage institutions. It is a distributed network based on the Polish fast (20GB) optical network:

PSCN network for digital libraries and archives

The PSNC network includes facilities for long-term preservation

Interestingly, the network services mostly smaller institutions. The Polish National Library and Archives have built their own systems.

Werla stressed that proper quality control at the production stage is difficult because of the bureaucratic Polish public procurement system.

Heiko Tjalsma of the Dutch research data archive DANS pitched the 4C project which was established to  ‘create a better understanding of digital curation costs through collaboration.’

Heiko Tjalsma about the 4C Project to get a grip on digital curation costs

Tjalsma: ‘We can only get a better idea of what digital curation costs by collaborating and sharing data’

At the moment there are several cost models available in the community (see, e.g., earlier posts), but they are difficult to compare. The 4C project intends to a) establish an international curation cost exchange framework, and b) build a Cost Concept Model – which will define what to include in the model and what to exclude.

The need for a clearer picture of curation costs is undisputed, but, Tjalsma added, ‘it is very difficult to gather detailed data, even from colleagues.’ Our organisations are reticent to make their financial data available. And both ‘time’ and ‘scale’ make matter more difficult. The only way to go seems to be anonimisation of data, and for that to work, the project must attract as many participants as possible. So: please register at http://www.4cproject.eu/community-resources/stakeholder-participation – and participate.

Building bridges between expert and manager

The last part of the day was devoted to building bridges between experts and managers. Dirk van Suchodeletz of the University of Freiburg introduced the session with a topic that is often considered an ‘expert-only’ topic: emulation.

Dirk von Suchodeletz

Dirk von Suchodeletz: ‘The EaaS project intends to make emulation available for a wider audience by providing it as a service.’

The emulation technique has been around for a while, and it is considered one of the few methods of preservation available for very complex digital objects – but takeup by the community has been slow, because it is seen as too complex for non-experts. The Emulation as a Service project intends to bridge the gap to practical implementation by taking away many of the technical worries from memory institutions. A demo of Emulation as a Service is available for OPF members. Von Suchodeletz encouraged his audience to have a look at it, because the service can only be made to work if many memory institutions decide to participate.

Seminar round table Managing Digital Preservation

Getting ready for the last roundtable discussion about the relationship between experts and managers

How R&D and the library business relate

‘What inspired the EaaS project,’ Hildelies Balk (KB) wanted to know from von Suchodeletz, ‘was it your own interest or was there some business requirement to be met?’ Von Suchodeletz admitted that it was his own research interest that kicked off the project; business requirements entered the picture later.

Birgit Henriksen of the Royal Library, Denmark: ‘We desperately need emulation to preserve the games in our collection, but because it is such a niche, funding is hard to come by.’ Jacqueline Slats of the Dutch National Archives echoed this observation: ‘The NA and the KB together developed the emulation tool Dioscuri, but because there was no business demand, development was halted. We may pick it up again as soon as we start receiving interactive material for preservation.’

This is what happened next, as visualised by Elco van Staveren:

Some highlights from the discussions:

  • Timing is of the essence. Obviously, R&D is always ahead of operations, but if it is too far ahead, funding will be difficult. Following user needs is no good either, because then R&D becomes mere procurement. Are there any cases of proper just-in-time development? Barbara Sierman of the KB suggested Jpylyzer (translation of Jpylyzer for managers) – the need arose for quality control in a massive TIFF/JP2000 migration at the KB intended to cut costs, and R&D delivered.
  • Another successful implementation: the Pronom registry. The National Archives had a clear business case for developing it. On the other hand, the GDFR technical registry did not tick the boxes of timeliness, impetus and context.
  • For experts and managers to work well together managers must start accepting a certain amount of failure. We are breaking new ground in digital preservation, failures are inevitable. Can we make managers understand that even failures make us stronger because the organisation gains a lot of experience and knowledge? And what is an acceptable failure rate? Henriksen suggested that managing expectations can do the trick. ‘Do not expect perfection.’

    Seminar managing digital preservation panel members

    Some of the panel members (from left to right) Maureen Pennock (British Library), Hildelies Balk (KB), Mies Langelaar (Rotterdam Municipal Archives), Barbara Sierman (KB) and Mette van Essen (Dutch National Archives)

  • We need a new set of metrics to define success in the ever changing digital world.
  • Positioning the R&D department within Collections can help make collaboration between the two more effective (Andersen, Pennock). Henriksen: ‘At the Danish Royal Library we have started involving both R&D and collections staff in scoping projects.’
  • And then again … von Suchodeletz suggested that sometimes a loose coupling between R&D and business can be more effective, because staff in operations can get too bogged down by daily worries.
  • Sometimes breaking down the wall is just too much to ask, suggested van Essen. We may have to decide to jump the wall instead, at least for the time being.
  • Bridge builders can be key to making projects succeed, staff members who speak both the languages of operations and of R&D. Balk and Pennock stressed that everybody in the organisation should know about the basics of digital preservation.
  • Underneath all of the organisation’s doings must lie a clear common vision to inspire individual actions, projects and collaboration.

In conclusion: participants agreed that this seminar had been a fruitful counterweight to technical hackatons in digital preservation. More seminars may follow. If you participated (or read these blogs), please use the commentary box for any corrections and/or follow-up.

‘In an ever changing digital world, we must allow for projects to fail – even failures bring us lots of knowledge.’

 

Roles and responsibilities in guaranteeing permanent access to the records of science – at the Conference for Academic Publishers (APE) 2014

On Tuesday 28 and Wednesday 29 January the annual Conference for Academic Publishers Europe was held in Berlin. The title of the conference: Redefining the Scientific Record. – Report by Marcel Ras (NCDD) and Barbara Sierman (KB)

Dutch politics set on “golden road” to Open Access

During the fist day the focus was on Open Access, starting with a presentation by the Dutch State Secretary for Education, Culture and Science on Open Access. In his presentation called “Going for Gold” Sander Dekker outlined his policy with regards to the practice of providing open access to research publications and how that practice will continue to evolve. Open access is “a moral obligation” according to Sander Dekker. Access to scientific knowledge is for everyone. It promotes knowledge sharing and knowledge circulation and is essential for further development of society.

OA "gold road" supporter and State Secretary Sander Dekker (right) during a recent visit to the K

“Golden road” open access supporter and State Secretary Sander Dekker (right) during a recent visit to the KB – photo KB/Jacqueline van der Kort

Open access means having electronic access to research publications, articles and books (free of charge). This is an international issue. Every year, approximately two million articles appear in 25,000 journals that are published worldwide. The Netherlands account for some 33,000 articles annually. Having unrestricted access to research results can help disseminate knowledge, move science forward, promote innovation and solve the problems that society faces.

The first steps towards open access were taken twenty years ago, when researchers began sharing their publications with one another on the Internet. In the past ten years, various stakeholders in the Netherlands have been working towards creating an open access system. A wide variety of rules, agreements and options for open access publishing have emerged in the research community. The situation is confusing for authors, readers and publishers alike, and the stakeholders would like this confusion to be resolved as quickly as possible.

The Dutch Government will provide direction so that the stakeholders know what to expect and are able to make arrangements with one another. It will promote “golden” open access: publication in journals that make research articles available online free of charge. The State Secretary’s aim is fully implement the golden road to open access within ten years, in other words by 2024. In order to achieve this, at least 60 per cent of all articles will have to be available in open access journals in five years’ time. A fundamental changeover will only be possible if we cooperate and coordinate with other countries.

Further reading: http://www.government.nl/issues/science/documents-and-publications/parliamentary-documents/2014/01/21/open-access-to-publications.html orhttp://www.rijksoverheid.nl/ministeries/ocw/nieuws/2013/11/15/over-10-jaar-moeten-alle-wetenschappelijke-publicaties-gratis-online-beschikbaar-zijn.html

Do researchers even want Open Access?

The two other keynote speakers, David Black and Wolfram Koch presented their concerns on the transition from the current publishing model to open access. Researchers are increasingly using subject repositories for sharing their knowledge. There is an urgent need for a higher level of organization and for standards in this field. But who will take the lead? Also, we must not forget the systems for quality assurance and peer review. These are under pressure as enormous quantities of articles are being published and peer review tends to take place more and more after publication. Open access should lower the barriers for access to research for the users, but what about the barriers for scholars publishing on their research? Koch stated that the traditional model worked fine for researchers. They don’t want to change. However, there do not seem to be any figures to support this assertion.

It is interesting to note that in almost all presentations on the first day of APE digital preservation was mentioned one way or the other. The vocabulary was different, but it is acknowledged as an important topic. Accessibility of scientific publications for the long term is a necessity, regardless of the publishing model.

KB and NCDD workshop on roles and responsibilities

The 2nd day of the conference the focus was on innovation (the future of the article, dotcoms) and on preservation!

The National Library of The Netherlands (KB) and the Dutch Coalition for Digital Preservation (NCDD) organized a session on preservation of scientific output: “Roles and responsibilities in guaranteeing permanent access to the scholarly record”. The session was chaired byMarcel Ras, program manager for the NCDD.

The trend towards e-only access for scholarly information is increasing at a rapid pace, as well as the volume of data which is ‘born digital’ and has no print counterpart. As for scholarly publications, half of all serial publications will be online-only by 2016. For researchers and students there is a huge benefit, as they now have online access to journal articles to read and download, anywhere, any time. And they are making use of it to an increasing extend. However, the downside is that there is an increasing dependency on access to digital information. Without permanent access to information scholarly activities are no longer possible. For libraries there are many benefits associated with publishing and accessing academic journals online. E-only access has the potential to save the academic sector a considerable amount of money. Library staff resources required to process printed materials can be reduced significantly. Libraries also potentially save money in terms of the management and storage of and end user access to print journals. While suppliers are willing to provide discounts for e-only access.

Publishers may not share post-cancellation and preservation concerns

However, there are concerns that what is now available in digital form may not always be available due to rapid technological developments or organisational developments within the publishing industry; these concerns and questions about post-cancellation access to paid-for content are key barriers to institutions making the move to e-only. There is a danger that e-journals become “ephemeral” unless we take active steps to preserve the bits and bytes that increasingly represent our collective knowledge. We are all familiar with examples of hardware becoming obsolete; 8 inch and 5.25 inch floppy discs, Betamax video tapes, and probably soon cd-roms. Also software is not immune to obsolescence.

In addition to this threat of technical obsolescence there is the changing role of libraries. Libraries have in the past assumed preservation responsibility for the resources they collect, while publishers have supplied the resources libraries need. This well-understood division of labour does not work in a digital environment and especially so when dealing with e-journals. Libraries buy licenses to enable their users to gain network access to a publisher’s server. The only original copy of an issue of an e-journal is not on the shelves of a library, but tends to be held by the publisher. But long-term preservation of that original copy is crucial for the library and research communities, and not so much for the publisher.

Can third-party solutions ensure safe custody?

So we may need new models and sometimes organizations to ensure safe custody of these objects for future generations. A number of initiatives have emerged in an effort to address these concerns. Research and development efforts in digital preservation issues have matured. Tools and services are being developed to help plan and perform digital preservation activities. Furthermore third-party organizations and archiving solutions are being established to help the academic community preserve publications and to advance research in sustainable ways. These trusted parties can be addressed by users when strict conditions (trigger events or post-cancellation) are met. In addition, publishers are adapting to changing library requirements, participating in the different archiving schemes and increasingly providing options for post-cancellation access.

In this session the problem was presented from the different viewpoints of the stakeholders in this game, focussing on the roles and responsibilities of the stakeholders.

Neil Beagrie explained the problem in depth, both in a technical, organisational and financial sense. He highlighted the distinction between perpetual access and digital preservation. In the case of perpetual access, organisations have a license or subscription for an e-journal and either the publisher discontinues the journal or the organisation stops its subscription – keeping e-journals available in this case is called “post-cancellation” . This situation differs from long-term preservation, where the e-journal in general is preserved for users whether they ever subscribed or not. Several initiatives for the latter situation were mentioned as well as the benefits organisations like LOCKSS, CLOCKSS, Portico and the e-Depot of the KB bring to publishers.  More details about his vision can be read in the DPC Tech Watch report Preservation, Trust and Continuing Access to e-Journals . (Presentation: APE2014_Beagrie)

Susan Reilly of the Association of European Research Libraries  (LIBER) sketched the changing role of research libraries. It is essential that the scholarly record is preserved, which encompasses e-journal articles, research data, e-books, digitized cultural heritage and dynamic web content. Libraries are a major player in this field and can be seen as an intermediary between publishers and researchers. (Presentation: APE2014_Reilly)

Eefke Smit of the International Association of Scientific, Technical and Medical Publishers (STM) explained to the audience why digital preservation was especially important in the playing field of STM publishers. Many services are available but more collaboration is needed. The APARSEN project is focusing of some aspects like trust, persistent identifiers and cost models, but there are still a wide range of challenges to be solved as the traditional publication models will continually change, from text and documents to “multi-versioned, multi-sourced and multi-media”. (Presentation: APE2014_Smit)

As Peter Burnhill from EDINA, University of Edinburgh, explained, continued access to the scholarly record is under threat as libraries are no longer the custodians of the scholarly record in e-journals. As he phrased it nicely: libraries no longer have e-collections but only e-connections. His KEEPERS registry is a global registry of e-journal archiving and offers an overview of who is preserving what. Organisations like LOCKSS, CLOCKSS, the e-Depot, the Chinese National Science Library and, recently, the US Library of Congress submit their holding information to this KEEPERS Registry. However nice, it was also emphasized that the registry only contains a small percentage of existing e-journals (currently about 19% of the e-journals with an ISSN assigned). More support for the preserving libraries and more collaboration with publishers is needed to preserve the e-journals of smaller publishers and improve coverage. (Presentation: APE2014_Burnhill)

(Reblogged with slight changes from http://www.ncdd.nl/blog/?p=3467)

Named entity recognition for digitised historical newspapers

Europeana NewspapersThe refinement partners in the Europeana Newspapers project will produce the astonishing amount of 10 million pages of full-text from historical newspapers from all over Europe. What could be done to further enrich that full-text?

The KB National Library of the Netherlands has been investigating named entity recognition (NER) and linked data technologies for a while now in projects such as IMPACT and STITCH+, and we felt it was about time to approach this on a production scale. So we decided to produce (open source) software, trained models as well as raw training data for NER software applications specifically for digitised historical newspapers as part of the project.

What is named entity recognition (NER)?

Named entity recognition is the process of identifying and classifying entities such as persons, locations and organisations in the full-text in order to enhance searchability. There are basically two types of approaches, a statistical and a rule based one. Rule based systems rely mostly on grammar rules defined by linguists, while statistical systems require large amounts of manually produced training data that they can learn from. While both approaches have their benefits and drawbacks, we decided to go for a statistical tool, the CRFNER system from Stanford University. In comparison, this software proved to be the most reliable, and it is supported by an active user community. Stanford University has an online demo where you can try it out: http://nlp.stanford.edu:8080/ner/.

ner

Example of Wikipedia article for Albert Einstein, tagged with the Stanford NER tool

Requirements & challenges

There are some particular requirements and challenges when applying these techniques to digital historical newspapers. Since full-text for more than 10 million pages will be produced in the project, one requirement for our NER tool was that it should be able to process large amounts of texts in a rather short time. This is possible with the Stanford tool,  which as of version 1.2.8 is “thread-safe”, i.e. it can run in parallel on a multi-core machine. Another requirement was to preserve the information about where on a page a named entity has been detected – based on coordinates. This is particularly important for newspapers: instead of having to go through all the articles on a newspaper page to find the named entity, it can be highlighted so that one can easily spot it even on very dense pages.

Then there are also challenges of course – mainly due to the quality of the OCR and the historical spelling that is found in many of these old newspapers. In the course of 2014 we will thus collaborate with the Dutch Institute for Lexicology (INL), who have produced modules which can be used in a pre-processing step before the Stanford system and that can to some extent mitigate problems caused by low quality of the full-text or the appearance of historical spelling variants.

The Europeana Newspapers NER workflow

For Europeana Newspapers, we decided to focus on three languages: Dutch, French and German. The content in these three languages makes up for about half of the newspaper pages that will become available through Europeana Newspapers. For the French materials, we cooperate with LIP6-ACASA, for Dutch again with INL. The workflow goes like this:

  1. We receive OCR results in ALTO format (or METS/MPEG21-DIDL containers)
  2. We process the OCR with our NER software to derive a pre-tagged corpus
  3. We upload the pre-tagged corpus into an online Attestation Tool (provided by INL)
  4. Within the Attestation Tool, the libraries make corrections and add tags until we arrive at a “gold corpus”, i.e. all named entities on the pages have been manually marked
  5. We train our NER software based on the gold corpus derived in step (4)
  6. We process the OCR again with our NER software trained on the gold corpus
  7. We repeat steps (2) – (6) until the results of the tagging won’t improve any further

    NER slide

    Screenshot of the NER Attestation Tool

Preliminary results

Named entity recognition is typically evaluated by means of Precision/Recall and F-measure. Precision gives an account of how many of the named entities that the software found are in fact named entities of the correct type, while Recall states how many of the total amount of named entities present have been detected by the software. The F-measure then combines both scores into a weighted average between 0 – 1. Here are our (preliminary) results for Dutch so far:

Dutch

Persons

Locations

Organizations

Precision

0.940

0.950

0.942

Recall

0.588

0.760

0.559

F-measure

0.689

0.838

0.671

These figures have been derived from a k-fold cross-evaluation of 25 out of 100 manually tagged pages of Dutch newspapers from the KB. The results confirm the fact that the Stanford system tends to be a bit “conservative”, i.e. it has a somewhat lower recall for the benefit of higher precision, which is also what we wanted.

Conclusion and outlook

Within this final year of the project we are looking forward to see in how far we can still boost these figures by adopting the extra modules from INL, and what results we can achieve on the French and German newspapers. We will also investigate software for linking the named entities to additional online resource descriptions and authority files such as DBPedia or VIAF to create Linked Data. The crucial question will be how well we can disambiguate the named entities and find a correct match in these resources. Besides, if there is time, we would also want to experiment with NER in other languages, such as Serbian or Latvian. And, if all goes well, you might already hear more about this at the upcoming IFLA newspapers conference “Digital transformation and the changing role of news media in the 21st Century“.

References

What if we do, in fact, know best?: A Response to the OCLC Report on DH and Research Libraries ← dh lib

Great answer by Dot Porter to the OCLC report What if we do, in fact, know best?: A Response to the OCLC Report on DH and Research Libraries ← dh lib.

Dot Porter’s response can be reinforced by a quote from DH ‘silverback’ Andrew Prescott from his influential essay An Electric Current of the Imagination: What the Digital Humanities Are and What They Might Become in  http://journalofdigitalhumanities.org/1-2/:

“digital humanities […]  has often developed from libraries and information services and it is frequently seen as a support service. One of the things that I am proudest of in my career is the way in which I have moved between being a curator, an academic, and a librarian. Museums, galleries, libraries, and archives are just as important to cultural health as universities. Indeed, I have found my time as a curator and librarian consistently far more intellectually exciting and challenging than being an academic.”

 

KB director in This Week in Libraries TWIL #103: The European Library

Watch KB director general Bas Savenije on This Week in Libraries talk on how KB, a national library, will integrate its infrastructure with that of public libraries. Also good stuff on why Europe needs to work together  in The European Library : ‘by working together on developing tools and services you can all share, you free up efforts in your library for other services you can offer your users!’

 

KB joins the leading Big Data conference in Europe!

hadoopsummitOn March 20-21, Hadoop Summit 2013, the leading big data conference, made its first ever appearance on European soil. The Beurs van Berlage in Amsterdam provided a splendid venue for the gathering of about 500 international participants interested in the newest trends around Big Data and Hadoop. The main hosts Hortonworks and Yahoo did an excellent job in putting together an exciting programme with two days full of enticing sessions divided by four distinct tracks: Applied Hadoop, Operating Hadoop, Hadoop Futures and Integrating Hadoop.

audienceHadoop Summit 2013, © http://www.flickr.com/photos/timoelliott/

The open-source Hadoop software framework allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale out from single servers to thousands of machines.

In his keynote, Hortonworks VP Shaun Connolly’s pointed out that already more than half the world’s data will be processed using Hadoop in 2015! Further on, there were keynotes by 451 Research Director Matt Aslett (What is the point of Hadoop?), Hortonworks founder and CEO Eric Baldeschwieler (Hadoop Now, Next and Beyond) and a live panel that discussed Real-World insight into Hadoop in the Enterprise.

vendorsVendor area at Hadoop Summit 2013, © http://www.flickr.com/photos/timoelliott/

Many interesting talks followed on the use and benefit derived from Hadoop at companies like Facebook, Twitter, Ebay, LinkedIn and alike, as well as on exciting upcoming technologies further enriching the Hadoop ecosystem such as Apache projects Drill, Ambari or the next-generation MapReduce implementation YARN.

The Koninklijke Bibliotheek and the Austrian National Library jointly presented their recent experiences with Hadoop in the SCAPE project. Clemens Neudecker and Sven Schlarb spoke about the potential of integrating Hadoop into digital libraries in their talk “The Elephant in the Library” (video: coming soon).


In the SCAPE project partners are experimenting with integrating Hadoop into library workflows for different large-scale data processing scenarios related to web archiving, file format migration or analytics – you can find out more about the Hadoop related activities in SCAPE here: 
http://www.scape-project.eu/news/scape-hadoop.

After two very successful days the Hadoop Summit concluded and participants agreed there needs to be another one next year – likely again to be held in the amazing city of Amsterdam!

Find out more about Hadoop Summit 2013 in Amsterdam:

Web:             http://hadoopsummit.org/amsterdam/
Facebook:    https://www.facebook.com/HadoopSummit
Pictures:      http://www.flickr.com/photos/timoelliott/
Tweets:       https://twitter.com/search/?q=hadoopsummit
Slides:          http://www.slideshare.net/Hadoop_Summit/
Videos:        http://www.youtube.com/user/HadoopSummit/videos
Blogs:           http://hortonworks.com/blog/hadoop-summit-2013-amsterdam-its-a-wrap/
                     http://www.sentric.ch/blog/hello-europe-hadoop-has-landed
                     http://janbruecher.blogspot.nl/2013/03/2013-hadoop-summit-day-1.html
                     http://janbruecher.blogspot.nl/2013/03/2013-hadoop-summit-day-2.html

IMPACT across the pond

IDHMC-Header-2cropped360EMOPlogo(withBackground)

Large amounts of historical books and documents are continuously being brought online through the many mass digitisation projects in libraries, museums and archives around the globe. While the availability of digital facsimiles already made these historical collections much more accessible, the key to unlock their full potential for scholarly research is making these documents fully searchable and editable – and this is still a largely problematic process.

During 2007 – 2012 the Koninklijke Bibliotheek coordinated the large-scale integrating project IMPACT – Improving Access to Text that explored different approaches to innovate OCR technology and significantly lowered the barriers that stand in the way of the mass digitisation of the European cultural heritage. The project concluded in June 2012 and led to the conception of the impact Centre of Competence in Digitisation.

texas-a-m-university-campus-in-college-station_slide

Texas A&M University campus, home of the “Aggies”

The Early Modern OCR Project (eMOP) is a new project established by the Initiative for Digital Humanities, Media and Culture at Texas A&M University with funding from the Andrew W. Mellon Foundation that will run from October 2012 through September 2014. The eMOP project draws upon the experiences and solutions from IMPACT to create technical resources for improving OCR for early modern English texts from Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) in order to make them available to scholars through the Advanced Research Consortium (ARC). The integration of post-correction and collation tools will enable scholars of the early modern period to exploit the more than 300,000 documents to their full potential. Already now the eMOP Zotero library is the place to find anything you ever wanted to know about OCR and  related technologies.

A4Oj7EUCUAEoNs7

eMOP is using the Aletheia tool from IMPACT partner PRImA to create ground truth for  the historical texts

MELCamp 2013 now provided a good opportunity to gather some of the technical collaborators on the eMOP project, like Clemens Neudecker from the Koninklijke Bibliotheek and Nick Laiacona from Performant Software for a meeting in College Station, Texas with the eMOP team at the IDHMC. Over the course of 25 – 28 March lively discussions evolved around finding the ideal setup for training the open-source OCR engine Tesseract to recognise English from the early modern period, fixing line segmentation in Gamera (thanks to Bruce Robertson), the creation of word frequency lists for historical English, and the question of how to combine all the various processing steps in a simple to use workflow using the Taverna workflow system.

A tour of Cushing Memorial Library and Archives with its rich collection of early prints and the official repository for George R.R. Martin’s writings wrapped up a nice and inspiring week in sunny Texas – to be continued!

Find out more about the Early Modern OCR project:

Web:                http://emop.tamu.edu/
Wiki:                http://emopwiki.tamu.edu/index.php/Main_Page
Video:              http://idhmc.tamu.edu/projects/Mellon/why.html
Blog:                http://emop.tamu.edu/blog