KB Research

Research at the National Library of the Netherlands

Author: ingeangevaare

How to maximise usage of digital collections

Libraries want to understand the researchers who use their digital collections and researchers want to understand the nature of these collections better. The seminar ‘Mining digital repositories’ brought them together at the Dutch Koninklijke Bibliotheek (KB) on 10-11 April, 2014, to discuss both the good and the bad of working with digitised collections – especially newspapers. And to look ahead at what a ‘digital utopia’ might look like. One easy point to agree on: it would be a world with less restrictive copyright laws. And a world where digital ‘portals’ are transformed into ‘platforms’ where researchers can freely ‘tinker’ with the digital data. – Report & photographs by Inge Angevaare, KB.

Mining Digital Repositories Conference 2014

Hans-Jorg Lieder of the Berlin State Library (front left) is given an especially warm welcome by conference chair Toine Pieters (Utrecht), ‘because he was the only guy in Germany who would share his data with us in the Biland project.’

Libraries and researchers: a changing relationship

‘A lot has changed in recent years,’ Arjan van Hessen of the University of Twente and the CLARIN project told me. ‘Ten years ago someone might have suggested that perhaps we should talk to the KB. Now we are practically in bed together.’

But each relationship has its difficult moments. Researchers are not happy when they discover gaps in the data on offer, such as missing issues or volumes of newspapers. Or incomprehensible transcriptions of texts because of inadequate OCR (optical character recognition). Conference organisers Toine Pieters and Jaap Verheul (University of Utrecht) invited Hans-Jorg Lieder of the Berlin State Library to explain why he ‘could not give researchers everything everywhere today’.

Lieder & Thomas: ‘Digitising newspapers is difficult’

Both Deborah Thomas of the Library of Congress and Hans-Jorg Lieder stressed how complicated it is to digitise historical newspapers. ‘OCR does not recognise the layout in columns, or the “continued on page 5”. Plus the originals are often in a bad state – brittle and sometimes torn paper, or they are bound in such a way that text is lost in the middle. And there are all these different fonts, e.g., Gothic script in German, and the well-known long-s/f confusion.’ Lieder provided the ultimate proof of how difficult digitising newspapers is: ‘Google only digitises books, they don’t touch newspapers.’

Mining Digital Repositories Damaged Newspapers

Thomas: ‘The stuff we are digitising is often damaged’

Another thing researchers should be aware of: ‘Texts are liquid things. Libraries enrich and annotate texts, versions may differ.’ Libraries do their best to connect and cluster collections of newspapers (e.g., in the Europeana Newspapers), but ‘the truth of the matter is that most newspapers collections are still analogue; at this moment we have only bits and pieces in digital form, and there is a lot of bad OCR.’ There is no question that libraries are working on improving the situation, but funding is always a problem. And the choices to be made with bad OCR are sometimes difficult: Should we manually correct it all, or maybe retype it, or maybe even wait a couple of years for OCR technology to improve?’

Mining Digital Repositories Conference Claeyssens Van Hessen Kenter

Librarians and researchers discuss what is possible and what not. From the left, Steven Claeyssens, KB Data Services, Arjan van Hessen, CLARIN, and Tom Kenter, Translantis.

Researchers: how to mine for meaning

Researchers themselves are debating how they can fit these new digital resources into their academic work. Obviously, being able to search millions of newspaper pages from different countries in a matter of days opens up a lot of new research possibilities. Conference organisers Toine Pieters and Jaap Verheul (University of Utrecht) are both involved in the HERA Translantis project which is taking a break from traditional ‘national’ historical research by looking at transnational influences of so-called ‘reference cultures’:

Mining digital repositories - Definition of reference cultures

Definition of Reference Cultures in the Translantis project which mines digital newspaper collections

In the 17th century the Dutch Republic was such a reference culture. In the 20th century the United States developed into a reference culture and Translantis digs deep into the digital newspaper archives of the Netherlands, the UK, Belgium and Germany to try and find out how the United States is depicted in public discourse:

Mining Digital Repositories Jaap Verheul Translantis

Jaap Verheul (Translantis) shows how the US is depicted in Dutch newspapers

Joris van Eijnatten introduced another transnational HERA project, ASYMENC, which is exploring cultural aspects of European identity with digital humanities methodologies.

All of this sounds straightforward enough, but researchers themselves have yet to develop a scholarly culture around the new resources:

  • What type of research questions do the digital collections allow? Are these new questions or just old questions to be researched in a new way?
  • What is scientific ‘proof’ if the collections you mine have big gaps and faulty OCR?
  • How to interpret the findings? You can search words and combinations of words in digital repositories, but how can you assess what the words mean? Meanings change over time. Also: how can you distinguish between irony and seriousness?
  • How do you know that a repository is trustworthy?
  • How to deal with language barriers in transnational research? Mere translations of concepts do not reflect the sentiment behind the words.
  • How can we analyse what newspapers do not discuss (also known as the ‘Voldemort’ phenomenon)?
  • How sustainable is digital content? Long-term storage of digital objects is uncertain and expensive. (Microfilms are much easier to keep, but then again, they do not allow for text mining …)
  • How do available tools influence research questions?
  • Researchers need a better understanding of text mining per se.

Some humanities scholars have yet to be convinced of the need to go digital

Rens Bod, Director of the Dutch Centre for Digital Humanities enthusiastically presented his ideas about the value of parsing (analysing parts of speech) for uncovering deep patterns in digital repositories. If you want to know more: Bod recently published a book about it.

Rens Bod

Professor Rens Bod: ‘At the University of Amsterdam we offer a free course in working with digital data.’

But in the context of this blog, his remarks about the lack of big data awareness and competencies among many humanities scholars, including young students, was perhaps more striking. The University of Amsterdam offers a crash course in working with digital data to bridge the gap. The one-week, free course, deals with all aspects of working with data, from ‘gathering data’ to ‘cooking data’.

As the scholarly dimensions of working with big data are not this blogger’s expertise, I will not delve into these further but gladly refer you to an article Toine Pieters and Jaap Verheul are writing about the scholarly outcomes of the conference [I will insert a link when it becomes available].

Mining Digital Repositories Jaap Verheul Toine Pieters

Conference hosts Jaap Verheul (left) and Toine Pieters taking analogue notes for their article on Mining Digital Repositories. And just in case you wonder: the meeting rooms are probably the last rooms in the KB to be migrated to Windows 7

More data providers: the ‘bad’ guys in the room

It was the commercial data providers in the room themselves that spoke of ‘bad guys’ or ‘bogey man’ – an image both Ray Abruzzi of Cengage Learning/Gale and Elaine Collins of DC Thomson Family History were hoping to at least soften a bit. Both companies provide huge quantities of digitised material. And, yes, they are in it for the money, which would account for their bogeyman image. But, they both stressed, everybody benefits from their efforts:

Value proposition of DC Thomson Family History

Value proposition of DC Thomson Family History

Cengage Learning is putting 25-30 million pages online annually. Thomson is digitising 750 million (!) newspaper & periodical pages for the British Library. Collins: ‘We take the risk, we do all the work, in exchange for certain rights.’ If you want to access the archive, you have to pay.

In and of itself, this is quite understandable. Public funding just doesn’t cut it when you are talking billions of pages. Both the KB’s Hans Jansen and Rens Bod (U. of Amsterdam) stressed the need for public/private partnerships in digitisation projects.

And yet.

Elaine Collins readily admitted that researchers ‘are not our most lucrative stakeholders’; that most of Thomson’s revenue comes from genealogists and the general public. So why not give digital humanities scholars free access to their resources for research purposes, if need be under the strictest conditions that the information does not go anywhere else? Both Abruzzi and Collins admitted that such restricted access is difficult to organise. ‘And once the data are out there, our entire investment is gone.’

Libraries to mediate access?

Perhaps, Ray Abruzzi allowed, access to certain types of data, e.g., metadata, could be allowed under certain conditions, but, he stressed, individual scholars who apply to Cengage for access do not stand a chance. Their requests for data are far too varied for Cengage to have any kind of business proposition. And there is the trust issue. Abruzzi recommended that researchers turn to libraries to mediate access to certain content. If libraries give certain guarantees, then perhaps …

Mining Digital Repositories Toine Pieters

You think OCR is difficult to read? Try human handwriting!

What do researchers want from libraries?

More data, of course, including more contemporary data (… ah, but copyright …)

And better quality OCR, please.

What if libraries have to choose between quality and quantity?  That is when things get tricky, because the answer would depend on the researcher you question. Some may choose quantity, others quality.

Should libraries build tools for analysing content? The researchers in the room seemed to agree that libraries should concentrate on data rather than tools. Tools are very temporary, and researchers often need to build the tools around their specific research questions.

But it would be nice if libraries started allowing users to upload enrichments to the content, such as better OCR transcriptions and/or metadata.

Mining Digital Repositories 2014

Researchers and libraries discussing what is desirable and what is possible. In the front row, from the left, Irene Haslinger (KB), Julia Noordegraaf (U. of Amsterdam), Toine Pieters (Utrecht), Hans Jansen (KB); further down the front row James Baker (British Library) and Ulrich Tiedau (UCL). Behind the table Jaap Verheul (Utrecht) and Deborah Thomas (Library of Congress).

And there is one more urgent request: that libraries become more transparent in what is in their collections and what is not. And be more open about the quality of the OCR in the collections. Take, e.g., the new Dutch national search service Delpher. A great project, but scholars must know exactly what’s in it and what’s not for their findings to have any meaning. And for scientific validity they must be able to reconstruct such information in retrospect. So a full historical overview of what is being added at what time would be a valuable addition to Delpher. (I shall personally communicate this request to the Delpher people, who are, I may add, working very hard to implement user requests).

American newspapers

Deborah Thomas of the US Library of Congress: ‘This digital age is a bit like the American Wild West. It is a frontier with lots of opportunities and hopes for striking it rich. And maybe it is a bit unruly.’

New to the library: labs for researchers

Deborah Thomas of the Library of Congress made no bones about her organisation’s strategy towards researchers: We put out the content, and you do with it whatever you want. In addition to API’s (Application Protocol Interfaces), the Library is also allowing for downloads of bulk content. The basic content is available free of charge, but additional metadata levels may come at a price.

The British Library (BL) is taking a more active approach. The BL’s James Baker explained how the BL is trying to bridge the gap between researchers and content by providing special labs for researchers. As I (unfortunately!) missed that parallel session, let me mention the KB’s own efforts to set up a KB lab where researchers are invited to experiment with KB data making use of open source tools. The lab is still in its ‘pre-beta phase’ as Hildelies Balk of the KB explained. If you want the full story, by all means attend the Digital Humanities Benelux Conference in the Hague on 12-13 June, where Steven Claeyssens and Clemens Neudecker of the KB are scheduled to launch the beta-version of the platform. Here is a sneak preview of the lab, a scansion machine built by KB Data Services in collaboration with phonologist Marc van Oostendorp (audio in Dutch):

https://www.youtube.com/watch?v=FcTufco9P3A

Europeana: the aggregator

“Portals are for visiting; platforms are for building on.”

Another effort by libraries to facilitate transnational research is the aggregation of their content in Europeana, especially Europeana Newspapers. For the time being the metadata are being aggregated, but in Alistair Dunning‘s vision, Europeana will grow from an end-user portal into a data brain, a cloud platform that will include the content and allow for metadata enrichment:

Alistair Dunning: 'Europeana must grow into

Alistair Dunning: ‘Europeana must grow into a data brain to bring disparate data sets together.’

Dunning's vision of Europeana in the future

Dunning’s vision of Europeana 3.0

Dunning also indicated that Europeana might develop brokerage services to clear content for non-commercial purposes. In a recent interview Toine Pieters said that researchers would welcome Europeana to take such a role, ‘because individual researchers should not be bothered with all these access/copyright issues.’ In the United States, the Library of Congress is not contemplating a move in that direction, Deborah Thomas told her audience. ‘It is not our mission to negotiate with publishers.’ And recent ‘Mickey Mouse’ legislation, said to have been inspired by Disney interests, seems to be leading to less rather than more access.

Dreaming of digital utopias

What would a digital utopia look like for the conference attendees? Jaap Verheul invited his guests to dream of what they would do if they were granted, say, €100 million to spend as they pleased.

Deborah Thomas of the Library of Congress would put her money into partnerships with commercial companies to digitise more material, especially the post-1922 stuff (less restrictive copyright laws being part and parcel of the dream). And she would build facilities for uploading enrichments to the data.

James Baker of the British Library would put his money into the labs for researchers.

Researcher Julia Noordegraaf of the University of Amsterdam (heritage and digital culture) would rather put the money towards improving OCR quality.

Joris van Eijnatten’s dream took the Europeana plans a few steps further. His dream would be of a ‘Globiana 5.0’ – a worldwide, transnational repository filled with material in standardised formats, connected to bilingual and multilingual dictionaries and researched by a network of multilingual, big data-savvy researchers. In this context, he suggested that ‘Google-like companies might not be such a bad thing’ in terms of sustainability and standardisation.

Joris van Eijnatten

Joris van Eijnatten: ‘Perhaps – and this is a personal observation – Google-like companies are not such a bad thing after all in terms of sustainability and standardisation of formats.’

At the end of the two-day workshop, perhaps not all of the ambitious agenda had been covered. But, then again, nobody had expected that.

Agenda for Mining Digital Repositories 2014

Mining Digital Repositories 2014 – the ambitious agenda

The trick is for providers and researchers to keep talking and conquer this ‘unruly’ Wild West of digital humanities bit by bit, step by step.

And, by all means, allow researchers to ‘tinker’ with the data. Verheul: ‘There is a certain serendipity in working with big data that allows for playfulness.’

See also:

 

Breaking down walls in digital preservation (Part 2)

Here is part 2 of the digital preservation seminar which identified ways to break down walls between research & development and daily operations in libraries and archives (continued from Breaking down walls in digital preservation, part 1). The seminar was organised by SCAPE and the Open Planets Foundation in The Hague on 2 April 2014. – Report & photographs by Inge Angevaare, visualisations by Elco van Staveren

Ross King picture wall between daily operations and research and development in digital preservation

Ross King of the Austrian Institute of Technology (and of OPF) kicking off the afternoon session by singlehandedly attacking the wall between daily operations and R&D

Experts meet managers

Ross King of the Austrian Institute of Technology described the features of the (technical) SCAPE project which intends to help institutions build preservation environments which are scalable – to bigger files, to more heterogeneous files, to a large volume of files to be processed. King was the one who identified the wall that exists between daily operations in the digital library and research & development (in digital preservation):

The wall between Production and R&D

The Wall between Production & R&D as identified by Ross King

Zoltan Szatucsket of the Hungarian National Archives shared his experiences with one of the SCAPE tools from a manager’s point of view: ‘Even trying out the Matchbox tool from the SCAPE project was too expensive for us.’ King admitted that the Matchbox case had not yet been entirely successful. ‘But our goal remains to deliver tools that can be downloaded and used in practice.’

Maureen Pennock of the British Library sketched her organisation’s journey to embed digital preservation [link to slides to follow]. Her own digital preservation department (now at 6 fte) was moved around a few times before it was nested in the Collection Care department which was then merged with Collection management. ‘We are now where we should be: in the middle of the Collections department and right next to the Document processing department. And we work closely with IT, strategy development, procurement/licensing and collection security and risk management.’

British Library strategy for digital preservation

The British Library’s strategy calls for further embedding of digital preservation, without taking the formal step of certification

Pennock elaborated on the strategic priorities mentioned above (see slides) by noting that the British Library has chosen not to strive for formal certification within the European Framework (unlike, e.g., the Dutch KB). Instead, the BL intends to hold bi-annual audits to measure progress. The BL intends to ensure that ‘all staff working with digital content understand preservation issues associated with it.’ Questioned by the Dutch KB’s Hildelies Balk, Pennock confirmed that the teaching materials the BL is preparing could well be shared with the wider digital preservation community. Here is Pennock’s concluding comment:

Digital preservation is like a bicycle - one size does not fit everyone, but everyone still recognises it as a library

Digital preservation is like a bicycle – one size doesn’t fit everyone … but everybody still recognises the bicycle

Marcin Werla from the Polish Supercomputing & Networking Centre PSNC provided an overview of the infrastructure PSNC is providing for research institutions and for cultural heritage institutions. It is a distributed network based on the Polish fast (20GB) optical network:

PSCN network for digital libraries and archives

The PSNC network includes facilities for long-term preservation

Interestingly, the network services mostly smaller institutions. The Polish National Library and Archives have built their own systems.

Werla stressed that proper quality control at the production stage is difficult because of the bureaucratic Polish public procurement system.

Heiko Tjalsma of the Dutch research data archive DANS pitched the 4C project which was established to  ‘create a better understanding of digital curation costs through collaboration.’

Heiko Tjalsma about the 4C Project to get a grip on digital curation costs

Tjalsma: ‘We can only get a better idea of what digital curation costs by collaborating and sharing data’

At the moment there are several cost models available in the community (see, e.g., earlier posts), but they are difficult to compare. The 4C project intends to a) establish an international curation cost exchange framework, and b) build a Cost Concept Model – which will define what to include in the model and what to exclude.

The need for a clearer picture of curation costs is undisputed, but, Tjalsma added, ‘it is very difficult to gather detailed data, even from colleagues.’ Our organisations are reticent to make their financial data available. And both ‘time’ and ‘scale’ make matter more difficult. The only way to go seems to be anonimisation of data, and for that to work, the project must attract as many participants as possible. So: please register at http://www.4cproject.eu/community-resources/stakeholder-participation – and participate.

Building bridges between expert and manager

The last part of the day was devoted to building bridges between experts and managers. Dirk van Suchodeletz of the University of Freiburg introduced the session with a topic that is often considered an ‘expert-only’ topic: emulation.

Dirk von Suchodeletz

Dirk von Suchodeletz: ‘The EaaS project intends to make emulation available for a wider audience by providing it as a service.’

The emulation technique has been around for a while, and it is considered one of the few methods of preservation available for very complex digital objects – but takeup by the community has been slow, because it is seen as too complex for non-experts. The Emulation as a Service project intends to bridge the gap to practical implementation by taking away many of the technical worries from memory institutions. A demo of Emulation as a Service is available for OPF members. Von Suchodeletz encouraged his audience to have a look at it, because the service can only be made to work if many memory institutions decide to participate.

Seminar round table Managing Digital Preservation

Getting ready for the last roundtable discussion about the relationship between experts and managers

How R&D and the library business relate

‘What inspired the EaaS project,’ Hildelies Balk (KB) wanted to know from von Suchodeletz, ‘was it your own interest or was there some business requirement to be met?’ Von Suchodeletz admitted that it was his own research interest that kicked off the project; business requirements entered the picture later.

Birgit Henriksen of the Royal Library, Denmark: ‘We desperately need emulation to preserve the games in our collection, but because it is such a niche, funding is hard to come by.’ Jacqueline Slats of the Dutch National Archives echoed this observation: ‘The NA and the KB together developed the emulation tool Dioscuri, but because there was no business demand, development was halted. We may pick it up again as soon as we start receiving interactive material for preservation.’

This is what happened next, as visualised by Elco van Staveren:

Some highlights from the discussions:

  • Timing is of the essence. Obviously, R&D is always ahead of operations, but if it is too far ahead, funding will be difficult. Following user needs is no good either, because then R&D becomes mere procurement. Are there any cases of proper just-in-time development? Barbara Sierman of the KB suggested Jpylyzer (translation of Jpylyzer for managers) – the need arose for quality control in a massive TIFF/JP2000 migration at the KB intended to cut costs, and R&D delivered.
  • Another successful implementation: the Pronom registry. The National Archives had a clear business case for developing it. On the other hand, the GDFR technical registry did not tick the boxes of timeliness, impetus and context.
  • For experts and managers to work well together managers must start accepting a certain amount of failure. We are breaking new ground in digital preservation, failures are inevitable. Can we make managers understand that even failures make us stronger because the organisation gains a lot of experience and knowledge? And what is an acceptable failure rate? Henriksen suggested that managing expectations can do the trick. ‘Do not expect perfection.’

    Seminar managing digital preservation panel members

    Some of the panel members (from left to right) Maureen Pennock (British Library), Hildelies Balk (KB), Mies Langelaar (Rotterdam Municipal Archives), Barbara Sierman (KB) and Mette van Essen (Dutch National Archives)

  • We need a new set of metrics to define success in the ever changing digital world.
  • Positioning the R&D department within Collections can help make collaboration between the two more effective (Andersen, Pennock). Henriksen: ‘At the Danish Royal Library we have started involving both R&D and collections staff in scoping projects.’
  • And then again … von Suchodeletz suggested that sometimes a loose coupling between R&D and business can be more effective, because staff in operations can get too bogged down by daily worries.
  • Sometimes breaking down the wall is just too much to ask, suggested van Essen. We may have to decide to jump the wall instead, at least for the time being.
  • Bridge builders can be key to making projects succeed, staff members who speak both the languages of operations and of R&D. Balk and Pennock stressed that everybody in the organisation should know about the basics of digital preservation.
  • Underneath all of the organisation’s doings must lie a clear common vision to inspire individual actions, projects and collaboration.

In conclusion: participants agreed that this seminar had been a fruitful counterweight to technical hackatons in digital preservation. More seminars may follow. If you participated (or read these blogs), please use the commentary box for any corrections and/or follow-up.

‘In an ever changing digital world, we must allow for projects to fail – even failures bring us lots of knowledge.’

 

Breaking down walls in digital preservation (Part 1)

People & knowledge are the keys to breaking down the walls between daily operations and digital preservation (DP) within our organisations. DP is not a technical issue, but information technology must be embraced as as a core feature of the digital library. Such were some of the conclusions of the seminar organised by the SCAPE project/Open Planets Foundation at the Dutch National Library (KB) and National Archives (NA) on Wednesday 2 April. – Report & photographs by Inge Angevaare, visualisations by Elco van Staveren

Newcomer questions some current practices

Menno Rasch (KB)

Menno Rasch (KB): ‘Do correct me if I am wrong’

Menno Rasch was appointed Head of Operations at the Dutch KB 6 months ago – but  ‘I still feel like a newcomer in digital preservation.’ His division includes the Collection Care department which is responsible for DP. But there are close working relationships with the Research and IT departments in the Innovation Division. Rasch’s presentation about embedding DP in business practices in the KB posed some provocative questions:

  • We have a tendency to cover up our mistakes and failures rather than expose them and discuss them in order to learn as a community. That is what pilots do. The platform is there, the Atlas of Digital Damages set up by the KB’s Barbara Sierman, but it is being underused. Of course lots of data are protected by copyright or privacy regulations, but there surely must be some way to anonimise the data.
  • In libraries and archives, we still look upon IT as ‘the guys that make tools for us’. ‘But IT = the digital library.’
  • We need to become more pragmatic. Implementing the OAIS standard is a lot of work – perhaps it is better to take this one step at a time.
  • ‘If you don’t do it now, you won’t do it a year from now.’
  • ‘Any software we build is temporary – so keep the data, not the software.’
  • Most metadata are reproducible – so why not store them in a separate database and put only the most essential preservation metadata in the OAIS information package? That way we can continue improving the metadata. Of course these must be backed up too (an annual snapshot?), but may tolerate a less expensive storage regime than the objects.
  • About developments at the KB: ‘To replace our old DIAS system, we are now developing software to handle all of our digital objects – which is an enormous challenge.’
SCAPE/OPF seminar on managing digital preservation, 4 April 2014, The Hague

The SCAPE/OPF seminar on Managing Digital Preservation, 2 April 2014, The Hague

Digital collections and the Titanic

Zoltan Szatucsket from the Hungarian National Archives used the Titanic for his presentation’s metaphor – without necessarily implying that we are headed for the proverbial iceberg, he added. Although, …  ‘many elements from the Titanic story can illustrate how we think’:

  • Titanic received many warnings about ice formations, and yet it was sailing at full speed when disaster struck.
  • Our ship – the organisation – is quite conservative. It wants to deal with digital records in the same way it deals with paper records. And at the Hungarian National Archives IT and archivist staff are in the same department, which does not work because they do not speak each others’ language.

    Zoltan Szatucsket SCAPESeminar

    Zoltan Szatucsket argued that putting together IT staff and archivists in the Hungarian National Archives caused ‘language’  problems; his Danish colleagues felt that in their case close proximity had rather helped improve communications

  • The captain must acquire new competences. He must learn to manage staff, funding, technology, equipment, etc. We need processes rather than tools.
  • The crew is in trouble too. Their education has not adapted to digital practices. Underfunding in the sector is a big issue. Strangely enough, staff working with medieval resources were much quicker to adopt digital practices than those working with contemporary material. They seem to want to put off any action until legal transfer to the archives actually occurs (15-20 years).
  • Echoing Menno Rasch’s presentation, Szatucsket asked the rhetorical question: ‘Why do we not learn from our mistakes?’ A few months after Titanic, another ship went down in similar circumstances
  • Without proper metadata, objects are lost forever.
  • Last but not least: we have learned that digital preservation is not a technical challenge. We need to create a complete environment in which to preserve.
Szatucsek at Digital Preservation seminar

Is DP heading for the iceberg as well? Visualisation of Szatucsek’s presentation.

OPF: trust, confidence & communication

Ed Fay was appointed director of the Open Planets Foundation (OPF) only six weeks ago. But he presented a clear vision of how the OPF should function within the community, crack in the middle, as a steward of tools, a champion of open communications, trust & confidence, a broker between commercial and non-commercial interests:

Ed Fay Open Planets Foundation vision

Ed Fay’s vision of the Open Planets Foundation’s role in the digital preservation community

Fay also shared some of his experiences in his former job at the London School of Economics:

Ed Fay London School of Economics Organisation

Ed Fay illustrated how digital preservation was moved around a few times in the London School of Economics Library, until it found its present place in the Library division

So, what works, what doesn’t?

The first round-table discussion was introduced by Bjarne Anderson of the Statsbiblioteket Aarhus (DK). He sketched his institution’s experiences in embedding digital preservation.

Bjarene Andersen Statsbiblioteket Aarhus

Bjarne Andersen (right) conferring with Birgit Henriksen (Danish Royal Library, left) and Jan Dalsten Sorensen (Danish National Archives. ‘SCRUM has helped move things along’

He mentioned the recently introduced SCRUM-based methodology as really having helped to move things along – it is an agile way of working which allows for flexibility. The concept of ‘user stories’ helps to make staff think about the ‘why’. Menno Rasch (KB) agreed: ‘SCRUM works especially well if you are not certain where to go. It is a step-by-step methodology.’

Some other lessons learned at Aarhus:

  • The responsibility for digital preservation cannot be with the developers implementing the technical solutions
  • The responsibility needs to be close to ‘the library’
  • Don’t split the analogue and digital library entirely – the two have quite a lot in common
  • IT development and research are necessary activities to keep up with a changing landscape of technology
  • Changing the organisation a few times over the years helped us educate the staff by bringing traditional collection/library staff close to IT for a period of time.
SCAPE seminar group discussion

Group discussion. From the left: Jan Dalsten Sorensen (DK), Ed Fay (OPF), Menno Rasch (KB), Marcin Werla (PL), Bjarne Andersen (DK), Elco van Staveren (KB, visualising the discussion), Hildelies Balk (KB) and Ross King (Austria)

And here is how Elco van Staveren visualised the group discussion in real time:

Some highlights from the discussion:

  • Embedding digital preservation is about people
  • It really requires open communication channels.
  • A hierarchical organisation and/or an organisation with silos only builds up the wall. Engaged leadership is called for. And result-oriented incentives for staff rather than hierarchical incentives.
  • Embedding digital preservation in the organisation requires a vision that is shared by all.
  • Clear responsibilities must be defined.
  • Move the budgets to where the challenges are.
  • The organisation’s size may be a relevant factor in deciding how to organise DP. In large organisations, the wheels move slowly (no. of staff in the Hungarian National Archives 700; British Library 1,500; Austrian National Library 400; KB Netherlands 300, London School of Economics 120, Statsbiblioteket Aarhus 200).
  • Most organisations favour bringing analogue and digital together as much as possible.
  • When it comes to IT experts and librarians/archivists learning each other’s languages, it was suggested that maybe hard IT staff need not get too deeply involved in library issues – in fact, some IT staff might consider it bad for their careers. Software developers, however, do need to get involved in library/archive affairs.
  • Management must also be taught the language of the digital library and digital preservation.

(Continued in Breaking down walls in digital preservation, part 2)

Seminar agenda and links to presentations

Keep Calm 'cause Titanic is Unsinkable

Roles and responsibilities in guaranteeing permanent access to the records of science – at the Conference for Academic Publishers (APE) 2014

On Tuesday 28 and Wednesday 29 January the annual Conference for Academic Publishers Europe was held in Berlin. The title of the conference: Redefining the Scientific Record. – Report by Marcel Ras (NCDD) and Barbara Sierman (KB)

Dutch politics set on “golden road” to Open Access

During the fist day the focus was on Open Access, starting with a presentation by the Dutch State Secretary for Education, Culture and Science on Open Access. In his presentation called “Going for Gold” Sander Dekker outlined his policy with regards to the practice of providing open access to research publications and how that practice will continue to evolve. Open access is “a moral obligation” according to Sander Dekker. Access to scientific knowledge is for everyone. It promotes knowledge sharing and knowledge circulation and is essential for further development of society.

OA "gold road" supporter and State Secretary Sander Dekker (right) during a recent visit to the K

“Golden road” open access supporter and State Secretary Sander Dekker (right) during a recent visit to the KB – photo KB/Jacqueline van der Kort

Open access means having electronic access to research publications, articles and books (free of charge). This is an international issue. Every year, approximately two million articles appear in 25,000 journals that are published worldwide. The Netherlands account for some 33,000 articles annually. Having unrestricted access to research results can help disseminate knowledge, move science forward, promote innovation and solve the problems that society faces.

The first steps towards open access were taken twenty years ago, when researchers began sharing their publications with one another on the Internet. In the past ten years, various stakeholders in the Netherlands have been working towards creating an open access system. A wide variety of rules, agreements and options for open access publishing have emerged in the research community. The situation is confusing for authors, readers and publishers alike, and the stakeholders would like this confusion to be resolved as quickly as possible.

The Dutch Government will provide direction so that the stakeholders know what to expect and are able to make arrangements with one another. It will promote “golden” open access: publication in journals that make research articles available online free of charge. The State Secretary’s aim is fully implement the golden road to open access within ten years, in other words by 2024. In order to achieve this, at least 60 per cent of all articles will have to be available in open access journals in five years’ time. A fundamental changeover will only be possible if we cooperate and coordinate with other countries.

Further reading: http://www.government.nl/issues/science/documents-and-publications/parliamentary-documents/2014/01/21/open-access-to-publications.html orhttp://www.rijksoverheid.nl/ministeries/ocw/nieuws/2013/11/15/over-10-jaar-moeten-alle-wetenschappelijke-publicaties-gratis-online-beschikbaar-zijn.html

Do researchers even want Open Access?

The two other keynote speakers, David Black and Wolfram Koch presented their concerns on the transition from the current publishing model to open access. Researchers are increasingly using subject repositories for sharing their knowledge. There is an urgent need for a higher level of organization and for standards in this field. But who will take the lead? Also, we must not forget the systems for quality assurance and peer review. These are under pressure as enormous quantities of articles are being published and peer review tends to take place more and more after publication. Open access should lower the barriers for access to research for the users, but what about the barriers for scholars publishing on their research? Koch stated that the traditional model worked fine for researchers. They don’t want to change. However, there do not seem to be any figures to support this assertion.

It is interesting to note that in almost all presentations on the first day of APE digital preservation was mentioned one way or the other. The vocabulary was different, but it is acknowledged as an important topic. Accessibility of scientific publications for the long term is a necessity, regardless of the publishing model.

KB and NCDD workshop on roles and responsibilities

The 2nd day of the conference the focus was on innovation (the future of the article, dotcoms) and on preservation!

The National Library of The Netherlands (KB) and the Dutch Coalition for Digital Preservation (NCDD) organized a session on preservation of scientific output: “Roles and responsibilities in guaranteeing permanent access to the scholarly record”. The session was chaired byMarcel Ras, program manager for the NCDD.

The trend towards e-only access for scholarly information is increasing at a rapid pace, as well as the volume of data which is ‘born digital’ and has no print counterpart. As for scholarly publications, half of all serial publications will be online-only by 2016. For researchers and students there is a huge benefit, as they now have online access to journal articles to read and download, anywhere, any time. And they are making use of it to an increasing extend. However, the downside is that there is an increasing dependency on access to digital information. Without permanent access to information scholarly activities are no longer possible. For libraries there are many benefits associated with publishing and accessing academic journals online. E-only access has the potential to save the academic sector a considerable amount of money. Library staff resources required to process printed materials can be reduced significantly. Libraries also potentially save money in terms of the management and storage of and end user access to print journals. While suppliers are willing to provide discounts for e-only access.

Publishers may not share post-cancellation and preservation concerns

However, there are concerns that what is now available in digital form may not always be available due to rapid technological developments or organisational developments within the publishing industry; these concerns and questions about post-cancellation access to paid-for content are key barriers to institutions making the move to e-only. There is a danger that e-journals become “ephemeral” unless we take active steps to preserve the bits and bytes that increasingly represent our collective knowledge. We are all familiar with examples of hardware becoming obsolete; 8 inch and 5.25 inch floppy discs, Betamax video tapes, and probably soon cd-roms. Also software is not immune to obsolescence.

In addition to this threat of technical obsolescence there is the changing role of libraries. Libraries have in the past assumed preservation responsibility for the resources they collect, while publishers have supplied the resources libraries need. This well-understood division of labour does not work in a digital environment and especially so when dealing with e-journals. Libraries buy licenses to enable their users to gain network access to a publisher’s server. The only original copy of an issue of an e-journal is not on the shelves of a library, but tends to be held by the publisher. But long-term preservation of that original copy is crucial for the library and research communities, and not so much for the publisher.

Can third-party solutions ensure safe custody?

So we may need new models and sometimes organizations to ensure safe custody of these objects for future generations. A number of initiatives have emerged in an effort to address these concerns. Research and development efforts in digital preservation issues have matured. Tools and services are being developed to help plan and perform digital preservation activities. Furthermore third-party organizations and archiving solutions are being established to help the academic community preserve publications and to advance research in sustainable ways. These trusted parties can be addressed by users when strict conditions (trigger events or post-cancellation) are met. In addition, publishers are adapting to changing library requirements, participating in the different archiving schemes and increasingly providing options for post-cancellation access.

In this session the problem was presented from the different viewpoints of the stakeholders in this game, focussing on the roles and responsibilities of the stakeholders.

Neil Beagrie explained the problem in depth, both in a technical, organisational and financial sense. He highlighted the distinction between perpetual access and digital preservation. In the case of perpetual access, organisations have a license or subscription for an e-journal and either the publisher discontinues the journal or the organisation stops its subscription – keeping e-journals available in this case is called “post-cancellation” . This situation differs from long-term preservation, where the e-journal in general is preserved for users whether they ever subscribed or not. Several initiatives for the latter situation were mentioned as well as the benefits organisations like LOCKSS, CLOCKSS, Portico and the e-Depot of the KB bring to publishers.  More details about his vision can be read in the DPC Tech Watch report Preservation, Trust and Continuing Access to e-Journals . (Presentation: APE2014_Beagrie)

Susan Reilly of the Association of European Research Libraries  (LIBER) sketched the changing role of research libraries. It is essential that the scholarly record is preserved, which encompasses e-journal articles, research data, e-books, digitized cultural heritage and dynamic web content. Libraries are a major player in this field and can be seen as an intermediary between publishers and researchers. (Presentation: APE2014_Reilly)

Eefke Smit of the International Association of Scientific, Technical and Medical Publishers (STM) explained to the audience why digital preservation was especially important in the playing field of STM publishers. Many services are available but more collaboration is needed. The APARSEN project is focusing of some aspects like trust, persistent identifiers and cost models, but there are still a wide range of challenges to be solved as the traditional publication models will continually change, from text and documents to “multi-versioned, multi-sourced and multi-media”. (Presentation: APE2014_Smit)

As Peter Burnhill from EDINA, University of Edinburgh, explained, continued access to the scholarly record is under threat as libraries are no longer the custodians of the scholarly record in e-journals. As he phrased it nicely: libraries no longer have e-collections but only e-connections. His KEEPERS registry is a global registry of e-journal archiving and offers an overview of who is preserving what. Organisations like LOCKSS, CLOCKSS, the e-Depot, the Chinese National Science Library and, recently, the US Library of Congress submit their holding information to this KEEPERS Registry. However nice, it was also emphasized that the registry only contains a small percentage of existing e-journals (currently about 19% of the e-journals with an ISSN assigned). More support for the preserving libraries and more collaboration with publishers is needed to preserve the e-journals of smaller publishers and improve coverage. (Presentation: APE2014_Burnhill)

(Reblogged with slight changes from http://www.ncdd.nl/blog/?p=3467)

On-line scholarly communications: vd Sompel and Treloar sketch the future playing field of digital archives

The Dutch data archive DANS invited two ‘great thinkers and doers’ (quote by Kevin Ashley on Twitter) in scholarly communications to do some out-of-the-box thinking about the future of scholarly communications – and the role of the digital archive in that picture. The joint efforts of DANS visiting fellows Herbert van de Sompel (Los Alamos) and Andrew Treloar (ANDS) made for a really informative and inspiring workshop on 20 January 2014 at DANS. Report & photographs by Inge Angevaare, KB Research

(a copy of) Rembrandt's 17th-century scholar Dr. Tulp overseeing Herbert van de Sompel outlining the research world of the 21st century

Rembrandt’s 17th-century scholar Dr. Tulp overseeing Herbert van de Sompel outlining the research world of the 21st century (the painting is a copy …)

Life used to be so simple. Researchers would do their research and submit their results in the form of articles to scholarly journals. The journals would filter out the good stuff, print it, and distribute it. Libraries around the world would buy the journals and any researcher wishing to build upon the published work could refer to it by simple citation. Years later and thousands of miles away, a simple citation would still bring you to an exact copy of the original work.

Van de Sompel and Treloar [the link brings you to their workshop slides] quoted Roosendaal & Geurts (1998) in summing up the functions this ‘journal system’ effectively performed:

  • Registration: allows claims of precedence for a scholarly finding (submission of manuscript)
  • Certification: establishes validity of claim (peer review, and post-publication commentary)
  • Awareness: allows actors in the system to remain aware of new claims (discovery services)
  • Archiving: preserves the scholarly record (libraries for print; publishers and special archives like LOCKSS, Portico and the KB for e-journals).
  • (A last function, that of academic recognition and rewards, was not discussed during this workshop.)

So far so good.

But then we went digital. And we created the world-wide web. And nothing was the same ever again.

Andrew Treloar (at the back) captivating his audience

Andrew Treloar (at the back) captivating his audience

Future scholarly communications: diffuse and ever-changing

Van de Sompel and Treloar went online to discover some pointers to what the future might look like – and found that the future is already here, ‘just not evenly distributed’. In other words: one discipline is moving into the digital reality at a faster pace than another, and geographically there are many differences too. But van de Sompel and Treloar found many pointers to what is coming and grouped them in Roosendaal & Geurts’s functional framework:

  • Registration is increasingly done on (discipline-specific) online platforms such as BioRxiv, ideacite (where one can register mere ‘ideas’!) and Github, a collaborative platform for software developers (also used by the KB research team).
    Common characteristics include:
    – Decoupling registration from certification
    – Timestamping, versioning
    – Registration of various types of objects
    – Machines also function as creators and contributors.
    (We’ll discuss below what these features mean for digital archiving)
  • Certification is also moving to lots of online platforms, such as PubMed Commons, PubPeer, ZooUniverse and even Slideshare, where the number of views and downloads is an indication of the interest generated by the contents.
    Common characteristics include:
    – Peer-review is decoupled from the publication process
    – Certification of various types of objects (not just text)
    – Machines carry out some of the validating
    – Social endorsement
  • Awareness is facilitated by online platforms such as the Dutch ‘gateway to scholarly information’ NARCIS, myExperiment and a really advanced platform such as eLabNotebook RSS where malaria research is being documented as it happens and completely in the open.
    Common characteristics include:
    – Awareness for various types of objects (not just text)
    – Real time awareness
    – Awareness support targeted at machines
    – Awareness through social media.
  • Archiving is done by library consortia such as CLOCKSS, data archives such as DANS Easy, and, although not mentioned during the presentation I may add our own KB e-Depot.
    Common characteristics include:
    – Archiving for various types of objects
    – Distributed archives
    – Archival consortia
    – Audit for trustworthiness (see, e.g., the European Framework for Audit and Certification of Digital Repositories).
Very few places remained unused

Very few seats remained unoccupied

Fundamental changes

Here’s how van de Sompel and Treloar summarise the fundamental changes going on. (The fact that the arrows point both ways is, to my mind, slightly confusing. The changes are from left to right, not the other way around.)

vdSompelTreloar32

Huge implications for digital libraries and archives

The above slide merits some study, because the implications for libraries and digital archives are huge. In the words of vd Sompel and Treloar:

vdSompelTreloar33

From the ‘journal system’ we are moving towards what van de Sompel and Treloar call a ‘Web of Objects’ which is much more difficult to organise in terms of archiving, especially because the ‘objects’ now include ever-changing software & operating systems, as well as data which are not properly handled and thus prone to disappear (Notice on student cafe door: ‘If you have stolen my laptop, you may keep it if you just let me download my PHD-thesis’).

Why archiving is more difficult in the Web of Objects

Why archiving is more difficult in the Web of Objects (if print is too small, check out Slideshare original)

It’s like web archiving – ‘but we have to do better’

Van de Sompel and Treloar compared scholarly communications to websites – ever-changing content, lots of different objects (software, text, video, etc.), links that go all over the place. Plus, I may add, a enormous variety of producers on the internet. Van de Sompel and Treloar concluded: ‘We have to do better than present web-archiving methods if we are to preserve the scholarly record in any meaningful way.’

Two 'great thinkers and doers' confer - Herbert van de Sompel (left) and Andrew Treloar

Two ‘great thinkers and doers’ confer – Herbert van de Sompel (left) and Andrew Treloar

‘The web platforms that are increasingly used for scholarship (Wikis, GitHub, Twitter, WordPress, etc.) have desirable characteristics, such as versioning, timestamping and social embedding. Still, they record rather than archive: they are short-term, without guarantees, read/write and reflect the scholarly process, whereas archiving concerns longer terms, is trying to provide guarantees, is read-only and results in the scholarly record.’

The slide below sums it all up – and it is with this slide that van de Sompel and Treloar turned the discussion over to their audience of some 70 digital data experts, mostly from the Netherlands:

A work in progress: the scholarly communications arena of the future

A work in progress: the scholarly communications arena of the future

Group discussions about the digital archive of the future

So, what does all of this mean for digital libraries and digital archives? One afternoon obviously was not enough to analyse the situation in full, but here are some of the comments reported from the (rather informal) break-out sessions:

  • One thing is certain: it is a playing field full of uncertainties. Velocity, variety and volume are the key characteristics of the emerging landscape. And everybody knows how difficult these are to manage.
  • The ‘document-centred’ days, where only journal and book publications were rated as First Class Scholarly Objects are over. Treloar suggested a move to a ‘researcher-centric’ approach, where First Class Objects include publications and data and software.
  • To complicate matters: the scholarly record is not all digital – there are plenty of physical objects to deal with.
  • How do we get stuff from the recording platforms to the archives? Van de Sompel suggested a combination of approaches. Some of it we may be able to harvest automatically. Some of it may come in because of rules and regulations. But Van de Sompel and Treloar both figured that rules and regulations would not be able to cover all of it. That is when Andrea Scharnhorst (workshop moderator, DANS) suggested that we will have to allow for a certain degree of serendipity (‘toeval’ in Dutch).
Andrea Scharnhorst (DANS): 'Perhaps we have to allow for a certain degree of serendipity'

Andrea Scharnhorst (DANS): ‘Perhaps we have to allow for a certain degree of serendipity’

  • Whatever libraries and archives do, time-stamped versioning will become an essential feature of any archival venture. This is the only way to ensure that scientists can adequately cite anything and verify any research (‘I used version X of software Y at time Z – which can be found in a fixed form in Archive D’).
  • The archival community introduced the concept of persistent identifiers (PID’s) to manage the uncertainties of the web. But perhaps the concept’s usefulness will be limited to the archival stage. Should we distinguish between operational use cases and archival use cases?
  • Lots of questions remain about roles and responsibilities in this new picture, and who is to pay for what. Looking at the Netherlands, the traditional distribution of tasks between the KB National Library (books, journals) and the data archives (research data) certainly merits discussion in the framework of the NCDD (Netherlands Organisation for Digital Preservation); the NCDD’s new programme manager, Marcel Ras, attended the workshop with interest.
Breakout discussions about infrastructure implications

Breakout discussions about infrastructure implications

  • Who or what will filter the stuff that is worth keeping from the rest?
  • Interoperability is key in this complex picture. And thus we will need standards and minimal requirements (as, e.g., in the Data Seal of Approval)
  • Perhaps baffled by so much uncertainty in the big picture, some attendants suggested that we first concentrate on what we have now and/or are developing now, and at least get that right. In other words, let’s not forget that there are segments of the scientific landscape that are being covered even now. The rest of the scholarly communications landscape was characterised by Laurents Sesink (DANS) as ‘the Wild West’.
In this breakout session, clearly discussions focussed on the role of the archive.

In this breakout session, clearly discussions focussed on the role of the archive. Selection: when and by whom? Roles and responsibilities?

  • What if the Internet fails? What if it succumbs to hacks and abuse? This possibility is not wholly unimaginable. But the workshop decided not to go there. At least not today.

In his concluding remarks Peter Doorn, Director of DANS, admitted that there had been doubts about organising this workshop. Even Herbert van de Sompel and Andrew Treloar asked themselves: ‘Do we know enough?’ Clearly, the answer is: no, we do not know what the future will bring. And that is maybe our biggest challenge: getting our minds to accept that we will never again ‘know enough’ at any time. While yet having to make decisions every day, every year, on where to go next. DANS is to be commended for creating a very open atmosphere and for allowing two great minds to help us identify at least some major trends to inspire our thinking.

See also:

  • tweets #rtwsaf (after the official name of the workshop, Riding the Wave and the Scholarly Archive of the Future – the title referring to the 2010 European Commission Report on Scholarly Communications which was the last major report on the issue available).
  • Blog post by Simon Hodson
Where do we go from here? Peter Doorn asked his two visiting fellows

Where do we go from here? Peter Doorn asked his two visiting fellows in Alice-in-Wonderland fashion

© 2018 KB Research

Theme by Anders NorenUp ↑