In my previous blog post I addressed the detection of broken audio files in an automated workflow for ripping audio CDs. For (data) CD-ROMs and DVDs that are imaged to an ISO image, a similar problem exists: how can we be reasonably sure that the created image is complete? In this blog post I will discuss some possible ways of doing this using existing tools, along with their limitations. I then introduce Isolyzer, a new tool that might be a useful addition to the existing methods.
At the KB we have a large collection of offline optical media. Most of these are CD-ROMs, but we also have a sizeable proportion of audio CDs. We’re currently in the process of designing a workflow for stabilising the contents of these materials using disk imaging. For audio CDs this involves ‘ripping’ the tracks to audio files. Since the workflow will be automated to a high degree, basic quality checks on the created audio files are needed. In particular, we want to be sure that the created audio files are complete, as it is possible that some hardware failure during the ripping process could result in truncated or otherwise incomplete files.
To get a better idea of what software tool(s) are best suitable for this task, I created a small dataset of audio files which I deliberately damaged. I subsequently ran each of these files through a set of candidate tools, and then looked which tools were able to detect the faulty files. The first half of this blog post focuses on the WAVE format; the second half covers the FLAC format (at the moment we haven’t decided on which format to use yet).
EpubCheck is an invaluable tool for assessing the quality of EPUB files. Still, it is possible that EPUBs that are valid according to the format specification (and thus EpubCheck) are nevertheless inaccessible to some users. Some weeks ago a colleague sent me an EPUB 2 file that produced some really strange behaviour across a number of viewer applications. For a start, the text wouldn’t reflow properly after re-sizing the viewer window, and increasing the font size resulted in garbled text. Running the file through EpubCheck did return some validation errors, but none of these were related to the behaviour I was getting. Closer inspection revealed some very peculiar stylesheet and HTML use.
About a month ago the International Digital Publishing Forum, the standards body behind the EPUB format, published an Editor’s Draft of EPUB 3.1. This is meant to be the successor of the current 3.0.1 version. IDPC has set up a community review, which allows interested parties to comment on the draft. The proposed changes relative to EPUB 3.0.1 are summarised in this document. A note at the top states (emphasis added by me):
The EPUB working group has opted for a radical change approach to the addition and deletion of features in the 3.1 revision to move the standard aggressively forward with the overarching goals of alignment with the Open Web Platform and simplification of the core specifications.
As Gary McGath pointed out earlier, this is a pretty bold statement for what is essentially a minor version. The authors of the draft also mention that they expect it “will provoke strong reactions both for and against”, and that changes that raise “strong negative reactions” from the community “will be reviewed for future drafts”.
This blog post is an attempt to identify the main implications of the current draft for libraries and archives: to what degree would the proposed changes affect (long-term) accessibility? Since the current draft is particularly notable for its aggressive removal of various existing EPUB features, I will focus on these. These observations are all based on the 30 January 2016 draft of the changes document.
The KB has quite a large collection of offline optical media, such as CD-ROMs, DVDs and audio CDs. We’re currently investigating how to stabilise the contents of these materials using disk imaging. During the initial phase of this work I did a number of tests with various open-source tools. It’s doubtful whether we’ll end up using these same tools in our actual workflows. The main reason for this is the sheer size of the collection, which we estimated at some 15,000 physical carriers; possibly even more. At those volumes we will need a solution that involves the use of a disk robot, and these often require dedicated software (we still need to investigate this more in-depth).
Nevertheless, throughout the initial testing phase I was surprised at the number of useful tools that are available in the open source domain. Since this will probably be of interest to others as well, I decided to polish a selection from my rough working notes into a somewhat more digestible form (or so I hope!). I edited my original notes down to the following topics:
- How to figure out the device path of the CD drive
- How to create an ISO image from a CD-ROM or DVD
- How to check the integrity of the created ISO image
- How to extract audio from an audio CD
In addition there’s a final section that covers my attempts at imaging a multisession / mixed mode CD. The result of this particular exercise wasn’t all that successful, but I included it anyway, as some may find it useful. All software mentioned here are open-source tools that are available for any modern Linux distribution (I’m using Linux Mint myself). Some can be used under Windows as well using Cygwin.
This is the second and final instalment of a 2-part blog on the use of PDF/A validators for identifying preservation risks in PDF. You can read the first part here. In Part 1 I showed how PDF/A validators can be used to identify preservation risks in a PDF. I illustrated this with an example that uses the PDF/A validator component of Adobe Acrobat’s Preflight tool. Needless to say, Acrobat is not scalabe to situations where you need to verify large volumes of PDFs. Luckily, several stand-alone PDF/A validators exist that are designed especially to do just that.
This is the first installment of a 2-part blog (part 2 is here). It was prompted by the upcoming Digital Preservation Coalition briefing When is a PDF not a PDF?, for which I was asked to prepare a presentation. My initial idea was to give an overview of the work we did on PDF preservation risk assessment using a PDF/A validator in the SCAPE project. Most of this has already been covered by a series of earlier blog posts. Those blogs very much represent different stages of a work in progress, and I think this makes them somewhat challenging for readers who are new to the subject.
The current version of the KB’s digital repository system (e-Depot) doesn’t include any tools for automated file format identification yet. Our previous DIAS system didn’t have identification functionality either. As a result, information on file formats in digital our collections is largely based on publisher metadata and file extensions. Neither are necessarily correct. Moreover, previous analyses revealed a number of prevalent file extensions that could not be easily linked to a specific format. One result of this situation was that we couldn’t even reliably tell to what extent patrons were able to view e-Depot content on the PCs in our reading rooms (the obviously common formats aside).
To get a better view of the formats in our collection, we did an analysis of the “top 50” most prevalent file extensions in our e-Depot: what are the corresponding formats, can these formats be automatically identified, and can we render them in our reading rooms? This blog post summarises the main findings of this work.
Back in 2012 the KB conducted a first investigation of the suitability of the EPUB format for long-term preservation. The KB will soon start receiving publications in this format, and in anticipation of this, our Collection Care department has formulated a policy on the minimum requirements an EPUB must meet to ensure long-term accessibility. The policy largely follows the recommendations from the 2012 report. This blog explores to what extent it is possible to automatically assess the EPUBs that we receive against our policy using a combination of the Epubcheck tool and Schematron rules.