KB Research

Research at the National Library of the Netherlands

Month: May 2013

Europeana – the case for funding: tweet #AllezCulture

A drastic cut was made in the budget for the Connecting Europe Facility (CEF) from 9 billion to 1 billion euros. This will hit Europeana, the infrastructure supporting Europe’s free digital library, museum and archive, very hard. Europeana is now being asked to put the case for funding under the revised guidelines for CEF, which were issued 28 May 2013. Europeana will face severe competition for the available funding from other digital service infrastructure such as e-Justice, e-Health and Safer Internet. All good causes in their own right, but the wonderful digital culture infrastructure that has been built in the last decade will soon get squashed if we do not speak out now!  So here goes: 

europeana

Here is a summary of the three arguments for funding:

1 Europeana supports economic growth.
Some Impact Indicators:

  • To date, 770 businesses, entrepreneurs, educational and cultural organisations are exploring ways of including Europeana information in their offerings (websites, apps, games etc.) through our API. See examples such as inventingeurope.eu and www.zenlan.com/collage/europeana.
  • Digital heritage creates jobs – in Hungary, for example, over 1,000 graduates are now involved in digitising heritage that will feed in to Europeana. Historypin in the UK predicts it will double in size with the availability of more open digital cultural heritage.

2. Europeana connects Europe. 

People often speak about closing the digital divide and opening up culture to new audiences but very few can claim such a big contribution to those efforts as Europeana’s shift to cultural commons.’ Neelie Kroes, Vice President of the Commission

3. Europeana makes Europe’s culture available for everyone.

In 2012, all 20m Europeana records were released under a Creative Commons Zero public domain dedication making them available for re-use both commercially and non-commercially. Europeana’s CC0 release is a ‘coup d’état’ that ‘will help to establish a precedent for other galleries, libraries, archives and museums to follow – which will in turn help to bring us that bit closer to a joined up digital commons of cultural content that everyone is free to use and enjoy.Jonathan Gray, Open Knowledge Foundation.

For those unaware of Europeana – here is what they do: 

Europeana has been transformative in opening up data and access to cultural heritage and now leads the world in accessible digital culture that will fuel
Europe’s digital economy. Through Europeana today, anyone can explore
27 million digitised objects including books, paintings, films and audio.

Europeana is a catalyst for change for cultural heritage

– Because they make cultural heritage accessible online.

– Because they have standardised the data of over 2,200 organisations, covering all European countries and 29 European languages.

– Because they provide creative industries and business start-ups with rich, interoperable material, complete with copyright information.

– And because they ensure that every citizen, whether young or old, privileged or deprived, can be a digital citizen.

So please support Europeana by tweeting, blogging, facebooking and whatever other media you like, using the hashtag #AllezCulture!

EPUB for archival preservation: an update

Author: Johan van der Knijff
Originally posted on: http://www.openplanetsfoundation.org/blogs/2013-05-23-epub-archival-preservation-update

Last year (2012) the KB released a report on the suitability of the EPUB format for archival preservation. A substantial number of EPUB-related developments have happened since then, and as a result some of the report’s findings and conclusions have become outdated. This applies in particular to the observations onEPUB 3, and the support of EPUB by characterisation tools. This blog post provides an update to those findings. It addresses the following topics in particular:

  • Use of EPUB in scholarly publishing
  • Adoption and use of EPUB 3
  • EPUB 3 reader support
  • Support of EPUB by characterisation tools

In the following sections I will briefly summarise the main developments in each of these areas, after which I will wrap up things in a concluding section.

Use of EPUB in scholarly publishing

Although scholarly publishing is still dominated by PDF, the use of EPUB in this sector is on the rise. This blog post by Todd Carpenter gives the following examples:

At the time of writing, the above publishers are all using EPUB 2.

Adoption and use of EPUB 3

Over the last year a number of organisations that are representing the publishing industry have expressed their support of EPUB 3. The Book Industry Study Group (BISG) is a trade association for companies in the publishing industry. Last year (August 2012) BISG released a policy statement in which it endorsed “EPUB 3 as the accepted and preferred standard for representing, packaging, and encoding structured and semantically enhanced Web content — including XHTML, CSS, SVG, images, and other resources — for distribution in a single-file format“. Early this year (March 2013) the International Publishers Association (IPA) issued a press releasethat also endorsed EPUB 3 as a “preferred standard format for representing HTML and other web content for distribution as single-file publications“. IPA represents over 60 national publishing organisations from more than 50 countries. Finally, the European Booksellers Federation recently released a report on the interoperability of eBook Formats. Its authors did a comparison of the features and functionality provided by EPUB 3, Amazon’s KF8 (Kindle) and Apple’s e-book formats. They concluded that EPUB 3 “clearly covers the superset of the expressive abilities of all the formats“, and that there is “no technical or functional reason not to use and establish EPUB 3 as an/the interoperable (open) ebook format standard“. This all suggests that EPUB 3 is widely supported by the publishing industry.

Having said that, the actual use of EPUB 3 is still limited at this stage, even though some publishers have already started using the format. Earlier this year technical publisher O’Reilly started releasing all their new eBook bundles in EPUB 3 format. The announcement mentions that their backlist will be updated as well. Interestingly, they decided to create “hybrid” EPUBs that are backward-compatible with EPUB 2. In November 2012 publisher Hachette also announced the launch of their EPUB 3 program.

EPUB 3 reader support

At this time reader support for EPUB 3 is still limited, but there have been a number of significant developments since the second half of 2012:

Support of EPUB by characterisation tools

The 2012 report concluded that EPUB was not optimally supported by characterisation tools. This situation has improved quite a lot since that time.

Identification

EPUB is now included in PRONOM, and has a corresponding DROID signature. This means that Fido should now be able to identify the format as well. On a side note, PRONOM doesn’t differentiate between EPUB 2 and 3, and it appears that the current record (which is only an outline record anyway) either combines both versions, or only refers to EPUB 2. PRONOM should probably be more specific on this.

Validation and feature extraction

The 2012 report included tests of 2 EPUB validator tools: epubcheck and flightcrew. While testing epubcheck in 2012, I was’t entirely happy with the rather unstructured output that the tool produced. Also, I couldn’t find any tool that was capable of extracting technical meta-information about an EPUB, like the presence of encryption or other digital rights management technology (feature extraction). Happily, starting with version 3.0 epubcheck is capable of extracting this kind of information. Moreover, it added an option to report its output in structured XML format that follows the JHOVE schema. I haven’t done any elaborate testing, but a quick run on some ofthese EPUB 3 samples showed that epubcheck was able to identify font obfuscation, in which case a property hasEncryption (valuetrue) is reported. I wasn’t able to find any EPUB files with DRM, so I cannot confirm if epubcheck detects this as well.

Flightcrew

As for flightcrew, no new versions of that tool have been released since August 2011, and it looks like it is not under any active development.

Discussion and conclusions

Since the release of the KB report on the suitability of EPUB for archival preservation the EPUB landscape has changed rather a lot. First, a number of academic publishers have started to offer scholarly content in this format. Although EPUB 3 is still in its early stages, various organisations representing the publishing industry have explicitly expressed their support of EPUB 3. A number of software applications now exist that are able to read the format, and work on a high-performance open source EPUB 3 Software Development Kit is backed by major players in the digital publishing industry (including e-reader manufacturers such as Kobo and Sony). EPUB support by characterisation tools has improved as well, mostly thanks to a number of recent enhancements ofepubcheck. So, overall, EPUB‘s credentials as a preservation format appear to have improved quite a bit over the last year. In the case of EPUB 3 it’s still too early to say anything about actual adoption, but the conditions for adoption to happen look pretty favourable. This is something I will get back to in my next update, perhaps in another year from now.

Useful links

Re:publica 2013: In/side/out

From 6-8 May the Re:publica conference was held in Berlin – one of the largest international conferences in the field of blogging, social media and our society in the digital age. In numbers alone the event is already quite impressive: with over 5.000 attendees, 300 speakers, 3 Tb of video footage or 95 hours of sessions, talks, and panel discussions and around 27.600 social media activities, it was hard to keep up: already before the conference started, the Twitter stream of #rp13 exploded with tweets!

republica_1          republica_2Photos by Gregor Fischer (https://www.flickr.com/photos/re-publica/)

But it was not only the size of the conference that was impressive: there was also a large number of interesting speakers and talks. Unfortunately my German was not sufficient to follow every detail, so I’ve mainly attended the sessions in English. This did make for an easier choice, as there were often around 10 parallel sessions and workshops. For most of the  presentations a video is online – a good overview of all sessions is available here: http://michaelkreil.github.io/republicavideos/.

Several talks focused on the differences in internet access and internet censorship worldwide. During the session ‘ 403 Forbidden: A Hands on Experience of the Iranian Internet’ you could gain more insight in how the Iranian internet is censored by participating in a quiz. Several websites and newpaper images were shown, and you had to guess which sites are blocked and which photos are manipulated.  The results were often surprising (for example, the sites of the KKK and Absolut Vodka are blocked, but the sites of Smirnoff Vodka and the American Nazi party are not), showing how internet filtering in Iran is both arbitrary and targeted, designed to make you feel insecure online.

michella obama

#403forbidden: Iranian photo manipulation of Michelle Obama’s dress

Another interesting talk in this area was ‘Internet Geographies: Data Shadows and Digital Divisions of Labour’ (unfortunately the video of this talk is not online yet). Mark Graham of the Oxford Internet Institute showed how, despite the initial hope that internet would offer everyone around the world the opportunity to share their knowledge, there are significant concentrations of knowledge. For example, Europe has just over 10% of the world’s population, but produces nearly 60% of all Wikipedia articles. There are even more Wikipedia articles written about Antarctica than about any country in South America or Africa. He illustrated these numbers with great visualisations, most of which you can find through this blog: http://www.zerogeography.net/2011/09/geographies-of-worlds-knowledge.html

In ‘Investigation 2.0’ Stephanie Hankey and Marek Tuszynksi of Tactical Tech discussed ways in which individual hackers and people working with data visualisation are nowadays working together with journalists to make certain underexposed data visible and present it in a way that gives us more insight into complex social and political issues. A good example of this is the visualisation that was made of drone strikes in Pakistan since 2004: http://drones.pitchinteractive.com.

Data visualisation can also be done through food, which the three women behind the cool website http://bindersfullofburgers.tumblr.com call ‘Data cuisine’ (or How to get juicy data from spreadsheets). They are working on visualising data that is usually thought of as boring and dry, such as election results, in creative and appealing ways. In Binders full of Burgers they displayed the US election results of 2012 with burgers and fries, while they used waffles, candy and fruit to show the local Berlin elections (http://wahlwaffeln.tumblr.com/). They also made their own talk more appealing by handing out cookies for audience participation :)

Then there was a good overview presentation by Joris Pekel of the Open Knowledge Foundation on Open Data & Culture – Creating the Cultural Commons. After presenting some impressive numbers of available online records, media files and metadata from Europeana, DPLA and Wikimedia Commons, he focused on the possibilities of enabling the public to connect and contextualise open data in the future through initiatives such as OpenGLAM. His talk is also available from Slideshare at http://t.co/bpqBEB28sb.

A very direct way of promoting open data, transparency and hacker culture was presented by Daniela B. Silva in Hacker culture on the road. In 2011, a Brazilian hacker community came up with the plan of starting a Hacker bus: they crowdfunded the budget for buying an old school bus and started travelling around Brazil, organising local events focused on promoting government transparency and hacker activism. Volunteer hackers, digital activists, lawyers, artists and journalists joined the bus, listened to the needs of the people they met and helped them to develop answers with the help of available technology, such as local street mapping applications and apps but also new local legislations.

Finally, one of the most fascinating talks was that of Neil Harbisson and Moon Ribas on Life with extra senses – How to become a cyborg. They showed ways of extending and creating new senses and perceptions by applying technology to the human body, as well as the artistic projects they created based on these sensory extensions. For example, due to a medical condition Neil Harbisson only sees in black and white, so he developed an electronic eye that transforms the colours around him into sound, allowing him to experience colour in a different way.  Together they have established the Cyborg Foundation to further promote their activities and help more people to become a cyborg, because they believe all humans should have the right to extend their senses and perceptions. Interestingly they see this as a way for people to get closer to nature instead of alienating themselves from it, since many animals have very different sensory perceptions than we do.

935450_502820113100275_1127485294_n

 

 

 

 

 

 

 

 

Neil Harbisson and Moon Ribas: when Neil scans Moon’s dress with his Eyeborg, he will hear the song ‘Moon River’ (Photo by Tony Sojka, https://www.flickr.com/photos/re-publica/)

All this is just a fraction of everything that was presented at Re:Publica: at http://michaelkreil.github.io/republicavideos/ you can find videos of nearly every talk – and if you speak German, you’re even more lucky!

Digital Humanities at the National library

About two months ago the Journal for Library Associations published an issue completely about Digital Humanities in libraries. Enthusiastically I printed all the open access articles (I know, not very nature conscious of me…) and put them on my desk. As it often goes with papers on desks, they’ve been lying there ever since. This changed this morning as I took out the stack and started reading them. And I loved it! Article after article I took out my highlighter and marked sentences and paragraphs that sounded too familiar to me, working in a research library with an interest in Digital Humanities.

The KB has started to look at Digital Humanities (DH) as a topic not so long ago, but has been involved with DH related projects for quite some time, although there were not called DH at the time. Examples are the CATCH projects that started in 2004, but since the beginning of our digitising-days our material is used by a variety of people and institutions. However, the KB is special in a way when it comes to doing DH research. We are the national library of the Netherlands and are thus not connected to a specific university or research institution. This means that we do not employ our own researchers. We do have a Research department, but most of the people here do not dive into our content, but do research to ensure the public can do this.

catchplus

The continuation of CATCH, CATCHPlus, where prototypes are converted to reliable tools

Although the JLA I read only discusses university libraries, and their associated researchers, this does not mean that the articles from for example Miriam Posner or Bethany Nowviskie are not relevant for us. The, often mentioned, lack of flexibility that is apparently inherent to a library also exist here and the desire to only publish something once it is perfect is something I too can relate to. Working with digitised material is never perfect. The software is not perfect, so how can the outcomes be? Nonetheless, the KB chose to show these imperfections in our OCR by opening up the texts to the public, including all mistakes and an estimate of accuracy.

kbnewspaper

A KB newspaper article. The OCR quality of this article is estimated at 84,9% character accuracy.

Being a research institute, with a large digital corpus that we are more than happy to share, without our own researchers (apart from the occasional research fellow), the KB not only faces the challenges of the university libraries as mentioned by Miriam Posner in her article (i.e. inflexibility, lack of time, authority, and incentive, overcautionesness, etc.), but I believe another crucial element can be added to this list: No affiliated researchers. Until not so long ago, when a researcher wanted to use (a section) of our digitised sets he/she would find someone from within the library who could help them get it. There was no official route to obtain the data or one contact person for a specific set, so it could be possible that people left the KB with hard disks full of images or that they tracked one of our employees down at a conference and badgered them until they got an e-mail with instructions on how to harvest a collection. Luckily, this has changed with the creation of the Data Services team.

The Data Services team are the go-to guys when it comes to our digital sets. They have taken up the responsibilities of advertising our datasets on our (unfortunately only in Dutch) website, at events and conferences and on the Dutch Open Data community, such as Open Data Nederland. We hope these efforts will lead to interesting use of our data and perhaps even some enrichments that we might implement in the future (OCR correction anyone?). But how can we be sure that our data does indeed gets used and that we reach the people who might be interested? And how do we know if our methods is in fact what they are looking for?

photo-1

The KB at the CLIN2013 conference.

This issue is one that I would imagine is easier to solve when you can simply walk to the other side of the building, knock on some doors and talk to researchers of whom you know their interests, because they teach Data Mining at your university. Unfortunately, we are not in that position, apart from the people who have asked for our data and those that will come to our (currently a work in progress) KB Lab. Now that we have the  instructions to harvest our sets available on the website, less and less people will probably be doing this, leaving us more in the dark about what interesting things are happening with the digitised Early Dutch Books Online or the ANP radio bulletins.

So, how do we get and stay in touch with interested parties that might contribute to the enrichment of our collections? How can we be sure that what we are doing is in fact what researchers need? How much can and do we want to adapt our methods to fit the need of researchers? For example, do we want to offer all possible data formats if there is a demand for it or is that something that the scholars might be able to tackle themselves? (Solutions and ultimate answers of course always welcome in the comment section below!)

We are undertaking several activities to try to find answers and also our place in the wonderful world of Digital Humanities. The establishment of our own KB Lab, where we will to work with scholars who wish to do something with our data, is one such activity. Another is the poster session that we will present together with the BL Labs project at the DH2013 conference this summer. Our aim there is to talk to different researchers about our collections and their ways of working. What types of collections they would like, what data format they would love to see, but also what they would like to do with our data. So, if you’re around in Nebraska, please come and find me at the posters and let’s talk this through!

© 2018 KB Research

Theme by Anders NorenUp ↑