KB Research

Research at the National Library of the Netherlands

Month: September 2013

Presenting European Historic Newspapers Online

As was posted earlier on this blog, the KB participates in the European project Europeana Newspapers. In this project, we are working together with 17 other institutions (libraries, technical partners and networking partners) to make 18 million European newspapers pages available via Europeana on title level. Next to this, The European Library is working on a specifically built portal to also make the newspapers available as full-text. However, many of the libraries do not have OCR for their newspapers yet, which is why the project is working together with the University of Innsbruck, CCS Content Conversion Specialists GmbH from Hamburg and the KB to enrich these pages with OCR, Optical Layout Recognition (OLR), and Named Entity Recognition (NER).

Hans-Jörg Lieder

Hans-Jorg Lieder of the Berlin State Library presents the Europeana Newspapers Project at our September 2013 workshop in Amsterdam.

In June, the project had a workshop on refinement, but it was now time to discuss aggregation and presentation. This workshop took place in Amsterdam on 16 September, during The European Library Annual Event. There was a good group of people, not only from the project partners and the associated partners, but also from outside the consortium. After the project, TEL hopes to be able to also offer these institutions a chance to send in their newspapers for Europeana, so we were very happy to have them join us.

The workshop kicked off with an introduction from Marieke Willems of LIBER and Hans-Joerg Lieder of the Berlin State Library.. They were followed by Markus Muhr from TEL, who introduced the aggregation plan and the schedule for the project partners. With so many partners, it can be quite difficult to find a schedule that works well, to ensure everyone has their material sent in on time. After the aggregation, TEL will then have to do some work on the metadata to convert it to the Europeana Data Model. Markus was followed by a presentation from Channa Veldhuijsen from the KB, who unfortunately, could not be there in person. However, her elaborate presentation on usability testing provided some good insights on how to get your website to be the best it can be and how to find out what your users really think when they are browsing your site.

[slideshare id=26336264&style=border: 1px solid #CCC; border-width: 1px 1px 0; margin-bottom: 5px;&sc=no]

It was then time for Alastair Dunning from TEL to showcase the portal that they have been preparing for Europeana Newspapers. Unfortunately, the wifi connection was not up to so many visitors and only some people could follow his presentation along on their own devices. However, there were some valuable feedback points which TEL will use to improve the portal. Unfortunately, the portal is not yet available from outside, so people who missed the presentation need to wait a bit longer to be able to see and browse the European newspapers.

But what we do already can see, are some websites of partners that have already been online for some time. It was very interesting to see the different choices each partner made to showcase their collection. We heard from people from the British Library, the National and University Library of Iceland, the National and University Library of Slovenia, the National Library of Luxembourg and the National Library of the Czech Republic.

P1100058

Yves Mauer from the National Library of Luxembourg presenting their newspaper portal

The day ended with a lovely presentation by Dean Birkett of Europeana, who, partly with Channa’s notes, went to all the previously presented websites and offered comments on how to improve them. The videos he used in his talk are available on Youtube. His key points were:

  1. Make the type size large: 16px is the recommended size.
  2. Be careful of colours. Some online newspapers sites use red to highlight important information but red is normally associated with warning signals and errors in the user’s mind.
  3. Use words to indicate language choices (eg. ‘english’, ‘français’) not flags. The Spanish flag won’t necessarily be interpreted to mean ‘click here for spanish’ if the user is from Mexico.
  4. Cut down on unnecessary text. Make it easy for users to skim (eg. though the use of bullet points).

All in all, it was a very useful afternoon in which I learned a lot about what users want from a website. If you want to see more, all presentations can be found at the Slideshare account of Europeana Newspapers or join us at one of the following events:

  • Workshop on Newspapers in Europe and the Digital Agenda. British Library, London. September 29-30th, 2014.
  • National Information Days.
    • National Library of Austria. March 25-26th, 2014.
    • National Library of France. April 3rd, 2014.
    • British Library. June 9th, 2014.

1st Succeed hackathon @ KB

Throughout recent weeks, rumors spread at KB National Library of the Netherlands that there would be a party of programmers coming to the library to participate in a so-called “hackathon”. In the beginning, especially the IT department was rather curious: will we have to expect port scans being done from within the National Library’s network? Do we need to apply special security measures? Fortunately, none of that was necessary.

A “hackathon” is nothing to be afraid of, normally. On the contrary: the informal gatherings of software developers to work collaboratively on creating and improving new or existing software tools and/or data have emerged as a prominent pattern in recent years – in particular the hack4Europe series of hack days that is organized by Europeana has shown that this model can also be successfully applied in the context of cultural heritage digitization.

After that was sorted, a network switch with static IP addresses was deployed by the facilities department of the KB, thereby ensuring that participants of the event had a fast and robust internet connection at all times and allowing access to the public parts of the internet and the restricted research infrastructure of the KB at the same time – which received immediate praise from the hackers. Well done, KB!

So when the software developers from Austria, England, France, Poland, Spain and the Netherlands gathered at the KB last Thursday, everyone already knew they were indeed here to collaboratively work on one of the European projects the KB is involved in: the Succeed project. The project had called in software developers from all over Europe to participate in the 1st Succeed hackathon to work on interoperability of tools and workflows for text digitization.

There was a good mix of people from the digitization as well as digital preservation communities, with some additional Taverna expertise tossed in. While about half of the participants had participated in either Planets, IMPACT or SCAPE, the other half of them were new to the field and eager to learn about the outcomes of these projects and how Succeed will address them.

And so after some introduction followed by coffee and fruit, the 15 participants immersed straight away into the various topics that were suggested prior to the event as needing attention. And indeed, the results that were presented by the various groups after 1.5 days (but only 8 hours of effective working time) were pretty impressive…

hack
Hackers at work @ KB Succeed hackathon

The developers from INL were able to integrate some of the servlets they created in IMPACT and Namescape with the interoperability-framework – although also some bugs were uncovered while doing so. They will be fixed asap, rest assured!  Also, with the help of the PSNC digital libraries team, Bob and Jesse were able to create a small training set for Tesseract, outperforming the standard dictionary despite some problems that were found in training Tesseract version 3.02. Fortunately it was possible to apply the training to version 3.0and then run the generated classifier in Tesseract version 3.02, which is the current stable(?) release.

Even better: the colleagues from Poznań (who have a track record of successful participation at hackathons) had already done some training with Tesseract earlier and developed some supporting tools for it. Quickly Piotr created a tool description for the “cutouts” tool that automatically creates binarized clippings of characters from a source image. On the second day another feature of the cutouts application was added: creating an artificial image suitable for training Tesseract from the binarized character clippings. When finally wrapping the two operations in a Taverna workflow time eventually ran out, but given only little work remained we look forward to see the Taverna workflow for Tesseract training becoming available shortly! Certainly this is also of interest to the eMOP project in the US, in which the KB is a partner as well.

Meanwhile, another colleague from Poznań was investigating the process of creating packages for Debian-based Linux operating systems from existing (open source) tools. And despite using a laptop with OSX Mountain Lion, Tomasz managed to present a valid Debian package (including even icon and man page) – kudos! Certainly the help of Carl from the Open Planets Foundation was also partly to blame for that…next steps will include creating a change log straight off github. To be continued!

psnc
Two colleagues from PSNC-dl working on a Tesseract training workflow

Another group attending the event were the team from LITIS lab at the University of Rouen. Thierry demonstrated the newest PLaIR tools such as the newspaper segmenter capable of automatically separating articles in scanned newspaper images.  The PLaIR tools use GEDI as the encoding format, so some work was immediately invested by David to also support the PAGE format, the predominant format for document encoding used in the IMPACT tools, thereby in principle establishing interoperability between IMPACT and PLaIR applications. In addition, since the PLaIR tools are mostly already available as web services, Philippine started with creating Taverna workflows for these methods. We look forward to complement the existing IMPACT workflows with those additional modules from PLaIR!

plairScreenshot of the PLaIR system for post-correction of newspaper OCR

All this was done without requiring any help from the PRImA group at the University of Salford, Greater Manchester, who are maintaining the PAGE format and a number of tools to support it. So with some free time on his hand, Christian from PRImA instead had a deeper look at Taverna and the PAGE serialization of the recently released open source OCR evaluation tool from the University of Alicante, the technical lead of the Centre of Competence, and found it to be working quite fine. Good to finally have an open source community tool for OCR evaluation with support for PAGE – and more features shall be added soon: we’re thinking word accuracy rate, bag-of-words evaluation and more – send us your feature requests (or even better: pull request).

We were particularly glad also that some developers beyond the usual MLA community suspects have found the way to the KB on those 2 days: a team from the Leiden University Medical Centre was also attending, keen on learning how they could use the T2-Client for their purposes. Initially slowed down by some issues encountered in deploying Taverna 2 Server on a Windows machine (don’t do it!), eventually Reinout and Eelke were able to resolve it simply by using Linux instead. We hope a further collaboration of Dutch Taverna users will arise from this!

Besides all the exciting new tools and features it was good to also see some others getting their hands dirty with (essential) engineering tasks – work progressed well on several issues from the interoperability-framework’s issue tracker: support for output directories is close to being fully implemented thanks to Willem Jan, and a good start was made on future MTOM support. Also Quique from the Centre of Competence was able to improve the integration between IMPACT services and the website Demonstrator Platform.

Without the help of experienced developers Carl from the Open Planets Foundation and Sven from the Austrian National Library (who had just conducted a training event for the SCAPE project earlier in the same week in London, and quickly decided to cross the channel for yet one more workshop), this would not have been so easily possible. While Carl was helping out everywhere at once, Sven found some time to fit in a Taverna training session after lunch on Friday, which was hugely appreciated from the audience.

sven
Sven Schlarb from the Austrian National Library delivering Taverna training

After seeing all the powerful capabilities of Taverna in combination with the interoperability-framework web services and scripts in a live demo, no one needed further reassurance that it was well worth spending the time to integrate this technology and work with the interoperability-framework and it’s various components.

Everyone said they really enjoyed the event and found plenty of valuable things that they had learned and wanted to continue working with. So watch out for the next Succeed hackathon in sunny Alicante next year!

Preservation at Scale: workshop report

Digital preservation practitioners from Portico and from the National Library of The Netherlands (KB) organized a workshop on “Preservation at Scale” as part of iPres2013. This workshop aimed to articulate and, if possible, to address the practical problems institutions encounter as they collect, curate, preserve, and make content accessible at Internet scale.

Preservation at scale has entailed continual development of new infrastructure. In addition to preservation of digital documents and publications, data archives are collecting a vast amount of content which must be ingested, stored and preserved. Whether we have to deal with nuclear physics materials, social science datasets, audio and video content, or e-books and e-journals, the amount of data to be preserved is growing at a tremendous pace.

The presenters at this workshop each spoke from the experience of organizations in the digital preservation space that are wrestling with the issues introduced by large scale preservation. Each of these organizations has experienced annual increases in throughput of content, which they have had to meet, not just with technical adaptations (increases in hardware and software processing power), but often also with organizational re-definition, along with new organizational structures, processes, training, and staff development.

There were a number of broad categories addressed by the workshop speakers and participants:

  1. Technological adaptations
  2. Institutional adaptations
  3. Quality assurance at scale and across scale
  4. The scale of the long tail
  5. Economies and diseconomies of scale

Technological Adaptations
Many of the organizations represented at this workshop have gone through one or more cycles of technological expansion, adaption, and platform migration to manage the current scale of incoming content, to take advantage of new advances in both hardware and software, or to respond to changes in institutional policy with respect to commercial vendors or suppliers.

These include both optimizations and large-scale platform migrations at the Koninklijke Bibliotheek, Harvard University Library, the Data Conservancy at Johns Hopkins University, and Portico, as well as the development by the PLANETS and SCAPE projects of frameworks, tools and test beds for implementing computing-intensive digital preservation processes such as the large-scale ingestion, characterization, and migration of large (multi-terabyte) and complex data sets.

A common challenge was reaching the limits of previous-generation architectures (whether those limits are those of capacity or of the capability to handle new digital object types), with the consequent need to make large-scale migrations both of content and of metadata.

Institutional Adaptations
For many of the institutions represented at this workshop, the increasing scale of digital collections has resulted in fundamental changes to those institutions themselves, including changes to an institution’s own definition of its mission and core activities. For these institutions, a difference in degree has meant a difference in kind.

For example, the Koninklijke Bibliotheek, the British Library, and Harvard University Library have all made digital preservation a library level mandate. This shift from relegating the preservation of digital content to an organizational sub-unit to ensuring that digital preservation is an organization-wide endeavor is challenging, as it requires changing the mindsets of many in each organization. It has meant reallocation of resources from other activities. It has necessitated strategic planning and budgeting for long-term sustainability of digital assets, including digital preservation tools and frameworks – a fundamental shift from one-time, project-based funding. It has meant making choices; we cannot do everything. It has meant comprehensive review of organizational structures and procedures, and has entailed equally comprehensive training and development of new skill sets for new functions.

Quality Assurance at Scale and Across Scales
A challenge to scaling up the acquisition and ingest of content is the necessity for quality assurance of that content. Often institutions are far downstream from the creators of content. This brings along many uncertainties and quality issues. There was much discussion of how institutions define just what is “good enough,” and how those decisions are reflected in the architecture of their systems. Some organizations have decided to compromise on ingest requirements as they have scaled up, while other organizations have remained quite strict about the cleanliness of content entering their archives. As the amount of unpreserved digital content continues to grow, this question of “what is sufficient” will persist as a challenge, as will the challenge of moving QA capabilities further upstream, closer to the actual producers of data.

The Scale of the Long Tail
As more and more content is both digitized and born digital, institutions are finding they must scale for increases in both resource access requests and expectations for completeness of collections.

The number of e-journals in the world that are not preserved was a recurrent theme. The exact number of journals that are not being preserved is unknown, but some facts are:

  • 79% of the 100,000 serials with ISSN are not being known to be preserved anywhere. It is not know how many serials that do not have ISSNs are being preserved.
  • In 2012, Cornell and Columbia University Libraries (2CUL) estimated that about 85% of e-serial content is unpreserved.

This digital “dark matter” is dwarfed in scope by existing and anticipated scientific and other research data, including that generated by sensor networks and by rich multimedia content.

Economies and Diseconomies of Scale
Perhaps the most important question raised at this workshop was the question as to whether we as a community are really at scale yet? Can we yet leverage true economies of scale? David Rosenthal noted that as we centralize more and more preserved content in fewer hands, we will be able to better leverage economies of scale, but we will also be increasing risk of a single point of failure.

Next Steps
The consensus of the group seemed to be that, as a whole, the digital preservation community is not yet truly at scale. However, the organizations in the room have moved beyond a project mentality and into a service oriented mentality, and are actively seeking ways to avoid wasteful duplication of effort, and to engage in active cooperation and collaboration.

Workshop presentations and notes on each presentation are available at: https://drive.google.com/folderview?id=0B1X7I2IVBtwzcGVhWUF0TmJIUms&usp=sharing

iPRES 2013

Author: Barbara Sierman

The iPRES2013 conference took place in beautiful Lisbon, together with the Dublin Core 2013 conference. In total there were around 400 people, from 38 countries.  Each conference had its own program. But the three (shared) key note speakers draw the attention from both the bibliographic people and the digital preservation in the room and sketched their views on important challenges we need to work on collaboratively. Gildas Illien (BnF) strongly advocated that  bibliographic people and digital preservation people would be more cooperative, as they both are trying to make the collections accessible but from a different angle. The user expectations should be leading in both fields and, if so, will require more collaboration in the organizations. Management need to be convinced of this. Paul Bertone from the European Bioinformatics Institute explained the recent breakthrough in storage: storage in DNA, which might be a solution for massive storage of data. And finally Carlos Morais Pires, from the European Commission, talked about Horizon 2020 and data infrastructures (and here – as libraries we need to point this out again and again: data is not only restricted to scientific data generated by instruments, but also the big data collections in libraries and data centres for social sciences ! Carlos Morais Pires immediately agreed on this and changed his slide.)

IMG_3486

Barbara Sierman speaking at iPRES

All presentations can be found on http://purl.pt/24107 ,  covering a wide range of aspects. There are simply so many aspects related to digital preservation ( webarchiving, preservation policies, open source preservation systems, trust, storage, and so on…). I can only advise you to have a look at the above mentioned URL.

Is there a trend to be discovered in all these presentations? To me, they demonstrate there is a lot of national and international collaboration nowadays. The European projects like Blog4Ever, SCAPE, APARSEN, ENSURE and Timbus, national initiatives like Goportis and international collaboration in the 4C project,  they all bring together people from various disciplines . No longer is it only about libraries, archives and data centers, but institutional repositories, health care and business are now also tackling the problem and are presenting their views.  The presentations reflect a greater self-confidence of the digital preservation community; we don’t have the answers to all challenges but we are developing a methodological way to deal with them: the development of standards, life cycle models, cost models, monitoring of the environment, lending from other communities to create tools etc. And most important of all, we know how to find each other.

IMG_2866

Organiser José Borbinha with all varietes of the conference badge on his shirt

But there was also another topic, mainly raised in discussions and during breaks. Our own organisations. The elephant in the room is the fact that our own organisations will need to deal with both analogue and digital material, while the expertise in dealing with analogue material is far more developed in the organisation then the competence of dealing with digital material. Someone said to me “these are different people”.  May be that is the case. Look at the sometimes heated debates  around reading e-books or preferring the paper ones. I like both and don’t think the paper book will disappear. So as a reader I will integrate both worlds and sometimes prefer a paper book above an e-book. This is the world we need to deal with, and organizations need to integrate both worlds. It will require training to have employees that are both  familiar with digital as well as print collections. This is a management challenge, but as digital preservation people we cannot close our eyes for it. We need to convince our management and as the keynote speaker Gildas Illien said (paraphrased by me): “ We need to show our added value. Use the rest of the world to convince your management.” This is how we as digital preservation people can exploit  our  existing collaboration structures!

A deadly sin

Author: Barbara Sierman
Originally posted on: http://digitalpreservation.nl/seeds/a-deadly-sin/

At last week’s iPRES2013 conference in Lisbon, a talk was given about an experiment on the migration of WARC files, done by Tessella, called Studies on the scalability of web preservation. One remark in the talk caused some rumour, namely the fact that the presenter suggested to adapt the WARC file and deviate from the standard. Why did they? Because – as we were told –  the current version of the Wayback Machine software, that enables you to render the WARC file format, is not optimal for rendering WARC files with conversion records. But tweaking the format of the Archival Information Package and store this for the long term is not the way we should go. We preserve information for long term. Our future custodians will not understand this (unless they are told so via metadata and even then) and will assume if they see a WARC format, all the rules in the standard are taken into account. Deviating from this is wrong, in fact it is almost a deadly sin.

After reading the corresponding publication (the conference papers are published by the Portugese National Library as a free e-book), I saw that things were less straight forward. The approach Tessella chose was to create two WARC files: a correct WARC according to the standards and an adapted  WARC for access. From the article:

 This required the development of two different workflows for creating migrated WARC files: one, which is formally correct according to the WARC standard, and maintains the integrity of the WARC schema, and a second which is more pragmatic,  and produces a file that can be displayed correctly by current WARC viewers. This pragmatic workflow can also be used for the migration of container formats  which do not support conversion records, such as ARC files.

So what should one do in the case an ISO  standard does not meet ones requirements? In this case the WARC standard is maintained by the BnF , which can easily be seen if one looks for the standard itself. This is especially mentioned on the internet so that people can get in touch. Another approach is to look for interested parties in the Wayback Machine software, which every one who is involved in web archiving knows, is the Internet Archive. And there is the IIPC, the International Internet Preservation Coalition that is currently initiating a developers working group to improve the Wayback Machine software. So if you have some problems with the standards, think about the millions of precious digital objects  that need to be preserved in that format and get in touch with the community. But don’t tweak the format itself!

© 2018 KB Research

Theme by Anders NorenUp ↑