The Koninklijke Bibliotheek (KB), National Library of the Netherlands is seeking proposals for its Researcher-in-residence program to start in 2017. This program offers a chance to early career researchers to work in the library with the Digital Humanities team and KB data. In return, we learn how researchers use the data of the KB. Together we will address your research question in a 6 month project using the digital collections of the KB and computational techniques. The output of the project will be incorporated in the KB Research Lab and is ideally beneficial for a larger (scholarly) community.
I don’t live or work in the Netherlands. Can I apply?
Probably! Contact us at email@example.com and we’ll discuss your options.
I want to use my own dataset. Is that possible?
Sure! As long as you also use one of the datasets of the KB and it doesn’t limit the publication of the project end results.
I don’t know how to code, is that a problem?
Not at all. We have skilled programmers who can help you with your project or we will try to find a match for you if you prefer someone else. This would mean submitting as a team and will cut the budget in half. Reach out to us to discuss the options.
This post is written by Dr. Jiyin He – Researcher-in-residence at the KB Research Lab from June – October 2014.
Being able to study primary sources is pivotal to the work of historians. Today’s mass digitisation of historical records such as books, newspapers, and pamphlets now provides researchers with the opportunity to study an unprecedented amount of material without the need for physical access to archives. Access to this material is provided through search systems, however, the effectiveness of such systems seems to lag behind the major web search engines. Some of the things that make web search engines so effective are redundancy of information, that popular material is often considered relevant material, and that the preferences of other users may be used to determine what you would find relevant. These properties do not hold or are unavailable for collections of historical material. In the past 3 months I have worked at the KB as a guest researcher. Together with Dr. Samuël Kruizinga, a historian, we explored how we can enhance the search system at KB to assist the search challenges of the historian. In this blogpost, I will share our experience of working together, the system we have developed, as well as lessons learnt during this project.
At DH2013, we presented a poster to ask researchers what they need from a National Library. The responses varied from ‘Nothing, just give us your data’ to ‘We’d like to be fully supported with tools and services’, showing once again that different users have different requirements. In order to accommodate all groups of researchers, the Collections department of the KB, who ‘own’ the data, and the Research department, where tools and services are developed, combined efforts and spoke to scholars to discuss the best method of supporting their work. However, we noticed that it was still quite difficult to get a good idea of how they used our data and in what way our actions and decisions would benefit them. Also, it seemed that researchers were often not aware of what activities the we undertake in this respect, which led to work being done twice.
The KB has about 10 million digitised newspaper pages, ranging from 1650 until 1995. We negotiated rights to make these pages available for research and this has happened more and more over the past years. However, we thought that many of these projects might be interested in knowing what others are doing and we wanted to provide a networking opportunity for them to share their results. This is why we organised a newspapers symposium focusing on the digitised newspapers of the KB, which was a great success!
[A] topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents (Wikipedia).
Topic modelling is a very popular method in the Digital Humanities to discover more about a large set of data and is also used by many researchers working on data of the KB. Unfortunately, not all topic modelling tools are as easy to access, due to a lack of technical skills or a lack of access to the data for example. The current guest researcher at the KB (Dr. Samuël Kruizinga) came across such problems while doing his research into the memory of the First World War in the KB newspapers. Not only was it difficult for him to select a corpus to work with, he was also unfamiliar with the go-to tool MALLET. Luckily, his university (Universiteit van Amsterdam) wanted to help and provided funds to organise a workshop, not only for him, but also for other academics interested in topic modelling.
Author: Tineke Koster
As I am writing this, volunteers are rekeying our 17th century newspapers articles. Optical character recognition of the gothic text type in use at the time has yielded poor results, making this part of our digital collection nearly inaccessible for full-text search. The Meertens institute, who have an excellent track record when it comes to crowdsourcing, has developed the editor (Dutch). Together with them we are working towards a full update of all newspaper issues from 1618 to 1700 that are available in our website Delpher.
Great news and, for some researchers, an eagerly awaited development. A bright future beckons in which our digital text corpus is 100% correct, just waiting to be mined for dynamic phenomena and paradigm shifts.
But we have to realize that without the proper precautions, correcting digital texts may also hinder researchers in their work. How so? These texts may have been used (browsed, mined, cited, etc.) by researchers in their earlier form. The improvement or enrichment may have consequences for the reproducibility of their research results.
For all researchers the need to reproduce research results is growing, with new guidelines due to new laws. There is also a specific group of researchers that need sustained access to older versions of digital text. The need is highest for research where the goal is to develop an algorithm and to assess its quality relative to previous versions of the same algorithm or to other algorithms. Without sustained access to older versions, these people cannot do their work.
Is it our role to provide this access? How the National Library of the Netherlands is thinking about this issue, I hope to explain in a later blogpost (soon!). Meanwhile, I would be very interested to hear your experiences. How is this subject discussed in your organization? Does your organization have a policy in place to deal with this?
On Tuesday 28 and Wednesday 29 January the annual Conference for Academic Publishers Europe was held in Berlin. The title of the conference: Redefining the Scientific Record. – Report by Marcel Ras (NCDD) and Barbara Sierman (KB)
Dutch politics set on “golden road” to Open Access
During the fist day the focus was on Open Access, starting with a presentation by the Dutch State Secretary for Education, Culture and Science on Open Access. In his presentation called “Going for Gold” Sander Dekker outlined his policy with regards to the practice of providing open access to research publications and how that practice will continue to evolve. Open access is “a moral obligation” according to Sander Dekker. Access to scientific knowledge is for everyone. It promotes knowledge sharing and knowledge circulation and is essential for further development of society.
Open access means having electronic access to research publications, articles and books (free of charge). This is an international issue. Every year, approximately two million articles appear in 25,000 journals that are published worldwide. The Netherlands account for some 33,000 articles annually. Having unrestricted access to research results can help disseminate knowledge, move science forward, promote innovation and solve the problems that society faces.
The first steps towards open access were taken twenty years ago, when researchers began sharing their publications with one another on the Internet. In the past ten years, various stakeholders in the Netherlands have been working towards creating an open access system. A wide variety of rules, agreements and options for open access publishing have emerged in the research community. The situation is confusing for authors, readers and publishers alike, and the stakeholders would like this confusion to be resolved as quickly as possible.
The Dutch Government will provide direction so that the stakeholders know what to expect and are able to make arrangements with one another. It will promote “golden” open access: publication in journals that make research articles available online free of charge. The State Secretary’s aim is fully implement the golden road to open access within ten years, in other words by 2024. In order to achieve this, at least 60 per cent of all articles will have to be available in open access journals in five years’ time. A fundamental changeover will only be possible if we cooperate and coordinate with other countries.
Further reading: http://www.government.nl/issues/science/documents-and-publications/parliamentary-documents/2014/01/21/open-access-to-publications.html orhttp://www.rijksoverheid.nl/ministeries/ocw/nieuws/2013/11/15/over-10-jaar-moeten-alle-wetenschappelijke-publicaties-gratis-online-beschikbaar-zijn.html
Do researchers even want Open Access?
The two other keynote speakers, David Black and Wolfram Koch presented their concerns on the transition from the current publishing model to open access. Researchers are increasingly using subject repositories for sharing their knowledge. There is an urgent need for a higher level of organization and for standards in this field. But who will take the lead? Also, we must not forget the systems for quality assurance and peer review. These are under pressure as enormous quantities of articles are being published and peer review tends to take place more and more after publication. Open access should lower the barriers for access to research for the users, but what about the barriers for scholars publishing on their research? Koch stated that the traditional model worked fine for researchers. They don’t want to change. However, there do not seem to be any figures to support this assertion.
It is interesting to note that in almost all presentations on the first day of APE digital preservation was mentioned one way or the other. The vocabulary was different, but it is acknowledged as an important topic. Accessibility of scientific publications for the long term is a necessity, regardless of the publishing model.
KB and NCDD workshop on roles and responsibilities
The 2nd day of the conference the focus was on innovation (the future of the article, dotcoms) and on preservation!
The National Library of The Netherlands (KB) and the Dutch Coalition for Digital Preservation (NCDD) organized a session on preservation of scientific output: “Roles and responsibilities in guaranteeing permanent access to the scholarly record”. The session was chaired byMarcel Ras, program manager for the NCDD.
The trend towards e-only access for scholarly information is increasing at a rapid pace, as well as the volume of data which is ‘born digital’ and has no print counterpart. As for scholarly publications, half of all serial publications will be online-only by 2016. For researchers and students there is a huge benefit, as they now have online access to journal articles to read and download, anywhere, any time. And they are making use of it to an increasing extend. However, the downside is that there is an increasing dependency on access to digital information. Without permanent access to information scholarly activities are no longer possible. For libraries there are many benefits associated with publishing and accessing academic journals online. E-only access has the potential to save the academic sector a considerable amount of money. Library staff resources required to process printed materials can be reduced significantly. Libraries also potentially save money in terms of the management and storage of and end user access to print journals. While suppliers are willing to provide discounts for e-only access.
Publishers may not share post-cancellation and preservation concerns
However, there are concerns that what is now available in digital form may not always be available due to rapid technological developments or organisational developments within the publishing industry; these concerns and questions about post-cancellation access to paid-for content are key barriers to institutions making the move to e-only. There is a danger that e-journals become “ephemeral” unless we take active steps to preserve the bits and bytes that increasingly represent our collective knowledge. We are all familiar with examples of hardware becoming obsolete; 8 inch and 5.25 inch floppy discs, Betamax video tapes, and probably soon cd-roms. Also software is not immune to obsolescence.
In addition to this threat of technical obsolescence there is the changing role of libraries. Libraries have in the past assumed preservation responsibility for the resources they collect, while publishers have supplied the resources libraries need. This well-understood division of labour does not work in a digital environment and especially so when dealing with e-journals. Libraries buy licenses to enable their users to gain network access to a publisher’s server. The only original copy of an issue of an e-journal is not on the shelves of a library, but tends to be held by the publisher. But long-term preservation of that original copy is crucial for the library and research communities, and not so much for the publisher.
Can third-party solutions ensure safe custody?
So we may need new models and sometimes organizations to ensure safe custody of these objects for future generations. A number of initiatives have emerged in an effort to address these concerns. Research and development efforts in digital preservation issues have matured. Tools and services are being developed to help plan and perform digital preservation activities. Furthermore third-party organizations and archiving solutions are being established to help the academic community preserve publications and to advance research in sustainable ways. These trusted parties can be addressed by users when strict conditions (trigger events or post-cancellation) are met. In addition, publishers are adapting to changing library requirements, participating in the different archiving schemes and increasingly providing options for post-cancellation access.
In this session the problem was presented from the different viewpoints of the stakeholders in this game, focussing on the roles and responsibilities of the stakeholders.
Neil Beagrie explained the problem in depth, both in a technical, organisational and financial sense. He highlighted the distinction between perpetual access and digital preservation. In the case of perpetual access, organisations have a license or subscription for an e-journal and either the publisher discontinues the journal or the organisation stops its subscription – keeping e-journals available in this case is called “post-cancellation” . This situation differs from long-term preservation, where the e-journal in general is preserved for users whether they ever subscribed or not. Several initiatives for the latter situation were mentioned as well as the benefits organisations like LOCKSS, CLOCKSS, Portico and the e-Depot of the KB bring to publishers. More details about his vision can be read in the DPC Tech Watch report Preservation, Trust and Continuing Access to e-Journals . (Presentation: APE2014_Beagrie)
Susan Reilly of the Association of European Research Libraries (LIBER) sketched the changing role of research libraries. It is essential that the scholarly record is preserved, which encompasses e-journal articles, research data, e-books, digitized cultural heritage and dynamic web content. Libraries are a major player in this field and can be seen as an intermediary between publishers and researchers. (Presentation: APE2014_Reilly)
Eefke Smit of the International Association of Scientific, Technical and Medical Publishers (STM) explained to the audience why digital preservation was especially important in the playing field of STM publishers. Many services are available but more collaboration is needed. The APARSEN project is focusing of some aspects like trust, persistent identifiers and cost models, but there are still a wide range of challenges to be solved as the traditional publication models will continually change, from text and documents to “multi-versioned, multi-sourced and multi-media”. (Presentation: APE2014_Smit)
As Peter Burnhill from EDINA, University of Edinburgh, explained, continued access to the scholarly record is under threat as libraries are no longer the custodians of the scholarly record in e-journals. As he phrased it nicely: libraries no longer have e-collections but only e-connections. His KEEPERS registry is a global registry of e-journal archiving and offers an overview of who is preserving what. Organisations like LOCKSS, CLOCKSS, the e-Depot, the Chinese National Science Library and, recently, the US Library of Congress submit their holding information to this KEEPERS Registry. However nice, it was also emphasized that the registry only contains a small percentage of existing e-journals (currently about 19% of the e-journals with an ISSN assigned). More support for the preserving libraries and more collaboration with publishers is needed to preserve the e-journals of smaller publishers and improve coverage. (Presentation: APE2014_Burnhill)
(Reblogged with slight changes from http://www.ncdd.nl/blog/?p=3467)
The Dutch data archive DANS invited two ‘great thinkers and doers’ (quote by Kevin Ashley on Twitter) in scholarly communications to do some out-of-the-box thinking about the future of scholarly communications – and the role of the digital archive in that picture. The joint efforts of DANS visiting fellows Herbert van de Sompel (Los Alamos) and Andrew Treloar (ANDS) made for a really informative and inspiring workshop on 20 January 2014 at DANS. Report & photographs by Inge Angevaare, KB Research
Life used to be so simple. Researchers would do their research and submit their results in the form of articles to scholarly journals. The journals would filter out the good stuff, print it, and distribute it. Libraries around the world would buy the journals and any researcher wishing to build upon the published work could refer to it by simple citation. Years later and thousands of miles away, a simple citation would still bring you to an exact copy of the original work.
- Registration: allows claims of precedence for a scholarly finding (submission of manuscript)
- Certification: establishes validity of claim (peer review, and post-publication commentary)
- Awareness: allows actors in the system to remain aware of new claims (discovery services)
- Archiving: preserves the scholarly record (libraries for print; publishers and special archives like LOCKSS, Portico and the KB for e-journals).
- (A last function, that of academic recognition and rewards, was not discussed during this workshop.)
So far so good.
But then we went digital. And we created the world-wide web. And nothing was the same ever again.
Future scholarly communications: diffuse and ever-changing
Van de Sompel and Treloar went online to discover some pointers to what the future might look like – and found that the future is already here, ‘just not evenly distributed’. In other words: one discipline is moving into the digital reality at a faster pace than another, and geographically there are many differences too. But van de Sompel and Treloar found many pointers to what is coming and grouped them in Roosendaal & Geurts’s functional framework:
- Registration is increasingly done on (discipline-specific) online platforms such as BioRxiv, ideacite (where one can register mere ‘ideas’!) and Github, a collaborative platform for software developers (also used by the KB research team).
Common characteristics include:
– Decoupling registration from certification
– Timestamping, versioning
– Registration of various types of objects
– Machines also function as creators and contributors.
(We’ll discuss below what these features mean for digital archiving)
- Certification is also moving to lots of online platforms, such as PubMed Commons, PubPeer, ZooUniverse and even Slideshare, where the number of views and downloads is an indication of the interest generated by the contents.
Common characteristics include:
– Peer-review is decoupled from the publication process
– Certification of various types of objects (not just text)
– Machines carry out some of the validating
– Social endorsement
- Awareness is facilitated by online platforms such as the Dutch ‘gateway to scholarly information’ NARCIS, myExperiment and a really advanced platform such as eLabNotebook RSS where malaria research is being documented as it happens and completely in the open.
Common characteristics include:
– Awareness for various types of objects (not just text)
– Real time awareness
– Awareness support targeted at machines
– Awareness through social media.
- Archiving is done by library consortia such as CLOCKSS, data archives such as DANS Easy, and, although not mentioned during the presentation I may add our own KB e-Depot.
Common characteristics include:
– Archiving for various types of objects
– Distributed archives
– Archival consortia
– Audit for trustworthiness (see, e.g., the European Framework for Audit and Certification of Digital Repositories).
Here’s how van de Sompel and Treloar summarise the fundamental changes going on. (The fact that the arrows point both ways is, to my mind, slightly confusing. The changes are from left to right, not the other way around.)
Huge implications for digital libraries and archives
The above slide merits some study, because the implications for libraries and digital archives are huge. In the words of vd Sompel and Treloar:
From the ‘journal system’ we are moving towards what van de Sompel and Treloar call a ‘Web of Objects’ which is much more difficult to organise in terms of archiving, especially because the ‘objects’ now include ever-changing software & operating systems, as well as data which are not properly handled and thus prone to disappear (Notice on student cafe door: ‘If you have stolen my laptop, you may keep it if you just let me download my PHD-thesis’).
It’s like web archiving – ‘but we have to do better’
Van de Sompel and Treloar compared scholarly communications to websites – ever-changing content, lots of different objects (software, text, video, etc.), links that go all over the place. Plus, I may add, a enormous variety of producers on the internet. Van de Sompel and Treloar concluded: ‘We have to do better than present web-archiving methods if we are to preserve the scholarly record in any meaningful way.’
‘The web platforms that are increasingly used for scholarship (Wikis, GitHub, Twitter, WordPress, etc.) have desirable characteristics, such as versioning, timestamping and social embedding. Still, they record rather than archive: they are short-term, without guarantees, read/write and reflect the scholarly process, whereas archiving concerns longer terms, is trying to provide guarantees, is read-only and results in the scholarly record.’
The slide below sums it all up – and it is with this slide that van de Sompel and Treloar turned the discussion over to their audience of some 70 digital data experts, mostly from the Netherlands:
Group discussions about the digital archive of the future
So, what does all of this mean for digital libraries and digital archives? One afternoon obviously was not enough to analyse the situation in full, but here are some of the comments reported from the (rather informal) break-out sessions:
- One thing is certain: it is a playing field full of uncertainties. Velocity, variety and volume are the key characteristics of the emerging landscape. And everybody knows how difficult these are to manage.
- The ‘document-centred’ days, where only journal and book publications were rated as First Class Scholarly Objects are over. Treloar suggested a move to a ‘researcher-centric’ approach, where First Class Objects include publications and data and software.
- To complicate matters: the scholarly record is not all digital – there are plenty of physical objects to deal with.
- How do we get stuff from the recording platforms to the archives? Van de Sompel suggested a combination of approaches. Some of it we may be able to harvest automatically. Some of it may come in because of rules and regulations. But Van de Sompel and Treloar both figured that rules and regulations would not be able to cover all of it. That is when Andrea Scharnhorst (workshop moderator, DANS) suggested that we will have to allow for a certain degree of serendipity (‘toeval’ in Dutch).
- Whatever libraries and archives do, time-stamped versioning will become an essential feature of any archival venture. This is the only way to ensure that scientists can adequately cite anything and verify any research (‘I used version X of software Y at time Z – which can be found in a fixed form in Archive D’).
- The archival community introduced the concept of persistent identifiers (PID’s) to manage the uncertainties of the web. But perhaps the concept’s usefulness will be limited to the archival stage. Should we distinguish between operational use cases and archival use cases?
- Lots of questions remain about roles and responsibilities in this new picture, and who is to pay for what. Looking at the Netherlands, the traditional distribution of tasks between the KB National Library (books, journals) and the data archives (research data) certainly merits discussion in the framework of the NCDD (Netherlands Organisation for Digital Preservation); the NCDD’s new programme manager, Marcel Ras, attended the workshop with interest.
- Who or what will filter the stuff that is worth keeping from the rest?
- Interoperability is key in this complex picture. And thus we will need standards and minimal requirements (as, e.g., in the Data Seal of Approval)
- Perhaps baffled by so much uncertainty in the big picture, some attendants suggested that we first concentrate on what we have now and/or are developing now, and at least get that right. In other words, let’s not forget that there are segments of the scientific landscape that are being covered even now. The rest of the scholarly communications landscape was characterised by Laurents Sesink (DANS) as ‘the Wild West’.
- What if the Internet fails? What if it succumbs to hacks and abuse? This possibility is not wholly unimaginable. But the workshop decided not to go there. At least not today.
In his concluding remarks Peter Doorn, Director of DANS, admitted that there had been doubts about organising this workshop. Even Herbert van de Sompel and Andrew Treloar asked themselves: ‘Do we know enough?’ Clearly, the answer is: no, we do not know what the future will bring. And that is maybe our biggest challenge: getting our minds to accept that we will never again ‘know enough’ at any time. While yet having to make decisions every day, every year, on where to go next. DANS is to be commended for creating a very open atmosphere and for allowing two great minds to help us identify at least some major trends to inspire our thinking.