KB Research

Research at the National Library of the Netherlands

Month: March 2014

Research Data Alliance in Dublin

logo-rda

Research Data Alliance (RDA); Dublin 26-28 maart 2014 door Barbara Sierman

Afgelopen week werd in Dublin de 3de plenaire bijeenkomst gehouden van de Research Data Alliance (RDA). Deze internationale organisatie, gebaseerd op vrijwillige deelname, is opgericht in 2013. Onderzoek verandert nu op grote schaal “big data” in de verschillende disciplines wordt ingezet, maar dit biedt ook uitdagingen: het delen van data, het bewaren van de bronnen, hoe te citeren, wie krijgt de “credits” voor dataverzamelingen, hoe linken we data aan publicaties etc. RDA is met name erop gericht om het delen van data, ondanks technische en sociale hindernissen, toch mogelijk maken ten behoeve van wetenschappelijk onderzoek – ook op langere termijn. “The RDA vision is researchers and innovators openly sharing data across technologies, disciplines, and countries to address the grand challenges of society” .

De RDA kent verschillende werkgroepen (met taken die binnen 18 maanden uitgevoerd moeten zijn en die concrete bijdragen leveren aan de implementatie van deze visie) en zogenaamde “interest groups” (IG), waarin bepaalde onderwerpen worden uitgediept. Een ‘interest group” kan uiteindelijk tot een werkgroep leiden.

Waarom neemt de KB deel aan de Research Data Alliance? Ten eerste hebben wij als nationale bibliotheek de onderzoekers ook “big data” te bieden, met onze gedigitaliseerde collecties en ons web archief. Het is belangrijk dat de geesteswetenschappen /humaniora ook meedoen in de discussies vanwege haar rol in de “scholarly communication”. Nu worden de discussies veelal gevoerd door disciplines als aardwetenschappen, “life sciences”, “health”, “ agriculture, space and climate”, die al een veel langere traditie hebben in de omgang met “big data”.

Daarnaast krijgen wij in onze collecties ook te maken met publicaties, die verwijzen naar datasets als bron voor het gedane onderzoek. Denk bijvoorbeeld aan het Internationaal e-Depot, maar ook de publicaties van de Nederlandse universiteiten (DARE) en Nederlandse wetenschappelijke uitgeverijen. Deze datasets zitten soms bij de publicaties (ook wel “verrijkte publicaties” genoemd). Wetenschappelijke uitgevers zoeken oplossingen om ofwel zelf deze data sets te bewaren of bijvoorbeeld te verwijzen naar centrale “data repositories” . Het is belangrijk deze ontwikkelingen te volgen, om een goede (toekomstige) dienstverlening te ontwikkelen voor onze duurzame collecties.

Vanuit de afdeling Onderzoek zit ik in twee groepen: Publishing Data (IG) en Certification of Digital Repositories (IG).

De Publishing Data Interest Group (weliswaar een IG maar met een 18 maanden plan) bestaat uit vier onderdelen:

  • Costs : de kosten van een data repository en hoe de financiering hiervan voor lange termijn te regelen (hier neemt de KB actief aan deel)
  • Bibliometrics: hoe wordt de dataset binnen de wetenschappelijke gemeenschap gewaardeerd (citatie indexen etc.)
  • Services: hoe te linken van publicaties naar data sets, tussen data sets onderling etc. Hier zit de KB niet in, maar de ontwikkelingen zijn belangrijk voor het concept van “verrijkte publicaties” waar ook vanuit de NCDD aan gewerkt wordt (meer daarover in mei)
  • Workflows: wat zijn de huidige manieren waarop data gepubliceerd worden en hoe kan dit verbeterd worden, rekening houdend met alle belanghebbenden.

De dagen in Dublin waren vooral bedoeld om de collega’s binnen de RDA te informeren over de gedane activiteiten en plannen, te kijken waar overlap met andere werkgroepen en interest groepen is en om binnen de Publishing Data groep de komende activiteiten vast te stellen.

Dit gold ook voor de Certification of Digital Repositories IG. Wil men echt kunnen vertrouwen op de data repositories en de toegankelijk van de data op lange termijn, dan zal er toch een controle mechanisme moeten komen. Dat is de certificering van de repositories als een “trustworthy repository”. Onze collega’s bij DANS zijn hierbij de trekkers , zij hebben immers oorspronkelijk het Data Seal of Approval ontwikkeld. De KB is in deze werkgroep betrokken vanwege mijn activiteiten op het gebied van certificering en de tot stand koming van de standaard ISO 16363 Audit and Certification of Trustworthy Digital Repositories, en waarover ik presenteerde.

Meer informatie over de WG’s en IG’s in de RDA is te vinden op de Research Data Alliance website.

Roles and responsibilities in guaranteeing permanent access to the records of science – at the Conference for Academic Publishers (APE) 2014

On Tuesday 28 and Wednesday 29 January the annual Conference for Academic Publishers Europe was held in Berlin. The title of the conference: Redefining the Scientific Record. – Report by Marcel Ras (NCDD) and Barbara Sierman (KB)

Dutch politics set on “golden road” to Open Access

During the fist day the focus was on Open Access, starting with a presentation by the Dutch State Secretary for Education, Culture and Science on Open Access. In his presentation called “Going for Gold” Sander Dekker outlined his policy with regards to the practice of providing open access to research publications and how that practice will continue to evolve. Open access is “a moral obligation” according to Sander Dekker. Access to scientific knowledge is for everyone. It promotes knowledge sharing and knowledge circulation and is essential for further development of society.

OA "gold road" supporter and State Secretary Sander Dekker (right) during a recent visit to the K

“Golden road” open access supporter and State Secretary Sander Dekker (right) during a recent visit to the KB – photo KB/Jacqueline van der Kort

Open access means having electronic access to research publications, articles and books (free of charge). This is an international issue. Every year, approximately two million articles appear in 25,000 journals that are published worldwide. The Netherlands account for some 33,000 articles annually. Having unrestricted access to research results can help disseminate knowledge, move science forward, promote innovation and solve the problems that society faces.

The first steps towards open access were taken twenty years ago, when researchers began sharing their publications with one another on the Internet. In the past ten years, various stakeholders in the Netherlands have been working towards creating an open access system. A wide variety of rules, agreements and options for open access publishing have emerged in the research community. The situation is confusing for authors, readers and publishers alike, and the stakeholders would like this confusion to be resolved as quickly as possible.

The Dutch Government will provide direction so that the stakeholders know what to expect and are able to make arrangements with one another. It will promote “golden” open access: publication in journals that make research articles available online free of charge. The State Secretary’s aim is fully implement the golden road to open access within ten years, in other words by 2024. In order to achieve this, at least 60 per cent of all articles will have to be available in open access journals in five years’ time. A fundamental changeover will only be possible if we cooperate and coordinate with other countries.

Further reading: http://www.government.nl/issues/science/documents-and-publications/parliamentary-documents/2014/01/21/open-access-to-publications.html orhttp://www.rijksoverheid.nl/ministeries/ocw/nieuws/2013/11/15/over-10-jaar-moeten-alle-wetenschappelijke-publicaties-gratis-online-beschikbaar-zijn.html

Do researchers even want Open Access?

The two other keynote speakers, David Black and Wolfram Koch presented their concerns on the transition from the current publishing model to open access. Researchers are increasingly using subject repositories for sharing their knowledge. There is an urgent need for a higher level of organization and for standards in this field. But who will take the lead? Also, we must not forget the systems for quality assurance and peer review. These are under pressure as enormous quantities of articles are being published and peer review tends to take place more and more after publication. Open access should lower the barriers for access to research for the users, but what about the barriers for scholars publishing on their research? Koch stated that the traditional model worked fine for researchers. They don’t want to change. However, there do not seem to be any figures to support this assertion.

It is interesting to note that in almost all presentations on the first day of APE digital preservation was mentioned one way or the other. The vocabulary was different, but it is acknowledged as an important topic. Accessibility of scientific publications for the long term is a necessity, regardless of the publishing model.

KB and NCDD workshop on roles and responsibilities

The 2nd day of the conference the focus was on innovation (the future of the article, dotcoms) and on preservation!

The National Library of The Netherlands (KB) and the Dutch Coalition for Digital Preservation (NCDD) organized a session on preservation of scientific output: “Roles and responsibilities in guaranteeing permanent access to the scholarly record”. The session was chaired byMarcel Ras, program manager for the NCDD.

The trend towards e-only access for scholarly information is increasing at a rapid pace, as well as the volume of data which is ‘born digital’ and has no print counterpart. As for scholarly publications, half of all serial publications will be online-only by 2016. For researchers and students there is a huge benefit, as they now have online access to journal articles to read and download, anywhere, any time. And they are making use of it to an increasing extend. However, the downside is that there is an increasing dependency on access to digital information. Without permanent access to information scholarly activities are no longer possible. For libraries there are many benefits associated with publishing and accessing academic journals online. E-only access has the potential to save the academic sector a considerable amount of money. Library staff resources required to process printed materials can be reduced significantly. Libraries also potentially save money in terms of the management and storage of and end user access to print journals. While suppliers are willing to provide discounts for e-only access.

Publishers may not share post-cancellation and preservation concerns

However, there are concerns that what is now available in digital form may not always be available due to rapid technological developments or organisational developments within the publishing industry; these concerns and questions about post-cancellation access to paid-for content are key barriers to institutions making the move to e-only. There is a danger that e-journals become “ephemeral” unless we take active steps to preserve the bits and bytes that increasingly represent our collective knowledge. We are all familiar with examples of hardware becoming obsolete; 8 inch and 5.25 inch floppy discs, Betamax video tapes, and probably soon cd-roms. Also software is not immune to obsolescence.

In addition to this threat of technical obsolescence there is the changing role of libraries. Libraries have in the past assumed preservation responsibility for the resources they collect, while publishers have supplied the resources libraries need. This well-understood division of labour does not work in a digital environment and especially so when dealing with e-journals. Libraries buy licenses to enable their users to gain network access to a publisher’s server. The only original copy of an issue of an e-journal is not on the shelves of a library, but tends to be held by the publisher. But long-term preservation of that original copy is crucial for the library and research communities, and not so much for the publisher.

Can third-party solutions ensure safe custody?

So we may need new models and sometimes organizations to ensure safe custody of these objects for future generations. A number of initiatives have emerged in an effort to address these concerns. Research and development efforts in digital preservation issues have matured. Tools and services are being developed to help plan and perform digital preservation activities. Furthermore third-party organizations and archiving solutions are being established to help the academic community preserve publications and to advance research in sustainable ways. These trusted parties can be addressed by users when strict conditions (trigger events or post-cancellation) are met. In addition, publishers are adapting to changing library requirements, participating in the different archiving schemes and increasingly providing options for post-cancellation access.

In this session the problem was presented from the different viewpoints of the stakeholders in this game, focussing on the roles and responsibilities of the stakeholders.

Neil Beagrie explained the problem in depth, both in a technical, organisational and financial sense. He highlighted the distinction between perpetual access and digital preservation. In the case of perpetual access, organisations have a license or subscription for an e-journal and either the publisher discontinues the journal or the organisation stops its subscription – keeping e-journals available in this case is called “post-cancellation” . This situation differs from long-term preservation, where the e-journal in general is preserved for users whether they ever subscribed or not. Several initiatives for the latter situation were mentioned as well as the benefits organisations like LOCKSS, CLOCKSS, Portico and the e-Depot of the KB bring to publishers.  More details about his vision can be read in the DPC Tech Watch report Preservation, Trust and Continuing Access to e-Journals . (Presentation: APE2014_Beagrie)

Susan Reilly of the Association of European Research Libraries  (LIBER) sketched the changing role of research libraries. It is essential that the scholarly record is preserved, which encompasses e-journal articles, research data, e-books, digitized cultural heritage and dynamic web content. Libraries are a major player in this field and can be seen as an intermediary between publishers and researchers. (Presentation: APE2014_Reilly)

Eefke Smit of the International Association of Scientific, Technical and Medical Publishers (STM) explained to the audience why digital preservation was especially important in the playing field of STM publishers. Many services are available but more collaboration is needed. The APARSEN project is focusing of some aspects like trust, persistent identifiers and cost models, but there are still a wide range of challenges to be solved as the traditional publication models will continually change, from text and documents to “multi-versioned, multi-sourced and multi-media”. (Presentation: APE2014_Smit)

As Peter Burnhill from EDINA, University of Edinburgh, explained, continued access to the scholarly record is under threat as libraries are no longer the custodians of the scholarly record in e-journals. As he phrased it nicely: libraries no longer have e-collections but only e-connections. His KEEPERS registry is a global registry of e-journal archiving and offers an overview of who is preserving what. Organisations like LOCKSS, CLOCKSS, the e-Depot, the Chinese National Science Library and, recently, the US Library of Congress submit their holding information to this KEEPERS Registry. However nice, it was also emphasized that the registry only contains a small percentage of existing e-journals (currently about 19% of the e-journals with an ISSN assigned). More support for the preserving libraries and more collaboration with publishers is needed to preserve the e-journals of smaller publishers and improve coverage. (Presentation: APE2014_Burnhill)

(Reblogged with slight changes from http://www.ncdd.nl/blog/?p=3467)

Named entity recognition for digitised historical newspapers

Europeana NewspapersThe refinement partners in the Europeana Newspapers project will produce the astonishing amount of 10 million pages of full-text from historical newspapers from all over Europe. What could be done to further enrich that full-text?

The KB National Library of the Netherlands has been investigating named entity recognition (NER) and linked data technologies for a while now in projects such as IMPACT and STITCH+, and we felt it was about time to approach this on a production scale. So we decided to produce (open source) software, trained models as well as raw training data for NER software applications specifically for digitised historical newspapers as part of the project.

What is named entity recognition (NER)?

Named entity recognition is the process of identifying and classifying entities such as persons, locations and organisations in the full-text in order to enhance searchability. There are basically two types of approaches, a statistical and a rule based one. Rule based systems rely mostly on grammar rules defined by linguists, while statistical systems require large amounts of manually produced training data that they can learn from. While both approaches have their benefits and drawbacks, we decided to go for a statistical tool, the CRFNER system from Stanford University. In comparison, this software proved to be the most reliable, and it is supported by an active user community. Stanford University has an online demo where you can try it out: http://nlp.stanford.edu:8080/ner/.

ner

Example of Wikipedia article for Albert Einstein, tagged with the Stanford NER tool

Requirements & challenges

There are some particular requirements and challenges when applying these techniques to digital historical newspapers. Since full-text for more than 10 million pages will be produced in the project, one requirement for our NER tool was that it should be able to process large amounts of texts in a rather short time. This is possible with the Stanford tool,  which as of version 1.2.8 is “thread-safe”, i.e. it can run in parallel on a multi-core machine. Another requirement was to preserve the information about where on a page a named entity has been detected – based on coordinates. This is particularly important for newspapers: instead of having to go through all the articles on a newspaper page to find the named entity, it can be highlighted so that one can easily spot it even on very dense pages.

Then there are also challenges of course – mainly due to the quality of the OCR and the historical spelling that is found in many of these old newspapers. In the course of 2014 we will thus collaborate with the Dutch Institute for Lexicology (INL), who have produced modules which can be used in a pre-processing step before the Stanford system and that can to some extent mitigate problems caused by low quality of the full-text or the appearance of historical spelling variants.

The Europeana Newspapers NER workflow

For Europeana Newspapers, we decided to focus on three languages: Dutch, French and German. The content in these three languages makes up for about half of the newspaper pages that will become available through Europeana Newspapers. For the French materials, we cooperate with LIP6-ACASA, for Dutch again with INL. The workflow goes like this:

  1. We receive OCR results in ALTO format (or METS/MPEG21-DIDL containers)
  2. We process the OCR with our NER software to derive a pre-tagged corpus
  3. We upload the pre-tagged corpus into an online Attestation Tool (provided by INL)
  4. Within the Attestation Tool, the libraries make corrections and add tags until we arrive at a “gold corpus”, i.e. all named entities on the pages have been manually marked
  5. We train our NER software based on the gold corpus derived in step (4)
  6. We process the OCR again with our NER software trained on the gold corpus
  7. We repeat steps (2) – (6) until the results of the tagging won’t improve any further

    NER slide

    Screenshot of the NER Attestation Tool

Preliminary results

Named entity recognition is typically evaluated by means of Precision/Recall and F-measure. Precision gives an account of how many of the named entities that the software found are in fact named entities of the correct type, while Recall states how many of the total amount of named entities present have been detected by the software. The F-measure then combines both scores into a weighted average between 0 – 1. Here are our (preliminary) results for Dutch so far:

Dutch

Persons

Locations

Organizations

Precision

0.940

0.950

0.942

Recall

0.588

0.760

0.559

F-measure

0.689

0.838

0.671

These figures have been derived from a k-fold cross-evaluation of 25 out of 100 manually tagged pages of Dutch newspapers from the KB. The results confirm the fact that the Stanford system tends to be a bit “conservative”, i.e. it has a somewhat lower recall for the benefit of higher precision, which is also what we wanted.

Conclusion and outlook

Within this final year of the project we are looking forward to see in how far we can still boost these figures by adopting the extra modules from INL, and what results we can achieve on the French and German newspapers. We will also investigate software for linking the named entities to additional online resource descriptions and authority files such as DBPedia or VIAF to create Linked Data. The crucial question will be how well we can disambiguate the named entities and find a correct match in these resources. Besides, if there is time, we would also want to experiment with NER in other languages, such as Serbian or Latvian. And, if all goes well, you might already hear more about this at the upcoming IFLA newspapers conference “Digital transformation and the changing role of news media in the 21st Century“.

References

© 2018 KB Research

Theme by Anders NorenUp ↑