Digital preservation practitioners from Portico and from the National Library of The Netherlands (KB) organized a workshop on “Preservation at Scale” as part of iPres2013. This workshop aimed to articulate and, if possible, to address the practical problems institutions encounter as they collect, curate, preserve, and make content accessible at Internet scale.
Preservation at scale has entailed continual development of new infrastructure. In addition to preservation of digital documents and publications, data archives are collecting a vast amount of content which must be ingested, stored and preserved. Whether we have to deal with nuclear physics materials, social science datasets, audio and video content, or e-books and e-journals, the amount of data to be preserved is growing at a tremendous pace.
The presenters at this workshop each spoke from the experience of organizations in the digital preservation space that are wrestling with the issues introduced by large scale preservation. Each of these organizations has experienced annual increases in throughput of content, which they have had to meet, not just with technical adaptations (increases in hardware and software processing power), but often also with organizational re-definition, along with new organizational structures, processes, training, and staff development.
There were a number of broad categories addressed by the workshop speakers and participants:
- Technological adaptations
- Institutional adaptations
- Quality assurance at scale and across scale
- The scale of the long tail
- Economies and diseconomies of scale
Many of the organizations represented at this workshop have gone through one or more cycles of technological expansion, adaption, and platform migration to manage the current scale of incoming content, to take advantage of new advances in both hardware and software, or to respond to changes in institutional policy with respect to commercial vendors or suppliers.
These include both optimizations and large-scale platform migrations at the Koninklijke Bibliotheek, Harvard University Library, the Data Conservancy at Johns Hopkins University, and Portico, as well as the development by the PLANETS and SCAPE projects of frameworks, tools and test beds for implementing computing-intensive digital preservation processes such as the large-scale ingestion, characterization, and migration of large (multi-terabyte) and complex data sets.
A common challenge was reaching the limits of previous-generation architectures (whether those limits are those of capacity or of the capability to handle new digital object types), with the consequent need to make large-scale migrations both of content and of metadata.
For many of the institutions represented at this workshop, the increasing scale of digital collections has resulted in fundamental changes to those institutions themselves, including changes to an institution’s own definition of its mission and core activities. For these institutions, a difference in degree has meant a difference in kind.
For example, the Koninklijke Bibliotheek, the British Library, and Harvard University Library have all made digital preservation a library level mandate. This shift from relegating the preservation of digital content to an organizational sub-unit to ensuring that digital preservation is an organization-wide endeavor is challenging, as it requires changing the mindsets of many in each organization. It has meant reallocation of resources from other activities. It has necessitated strategic planning and budgeting for long-term sustainability of digital assets, including digital preservation tools and frameworks – a fundamental shift from one-time, project-based funding. It has meant making choices; we cannot do everything. It has meant comprehensive review of organizational structures and procedures, and has entailed equally comprehensive training and development of new skill sets for new functions.
Quality Assurance at Scale and Across Scales
A challenge to scaling up the acquisition and ingest of content is the necessity for quality assurance of that content. Often institutions are far downstream from the creators of content. This brings along many uncertainties and quality issues. There was much discussion of how institutions define just what is “good enough,” and how those decisions are reflected in the architecture of their systems. Some organizations have decided to compromise on ingest requirements as they have scaled up, while other organizations have remained quite strict about the cleanliness of content entering their archives. As the amount of unpreserved digital content continues to grow, this question of “what is sufficient” will persist as a challenge, as will the challenge of moving QA capabilities further upstream, closer to the actual producers of data.
The Scale of the Long Tail
As more and more content is both digitized and born digital, institutions are finding they must scale for increases in both resource access requests and expectations for completeness of collections.
The number of e-journals in the world that are not preserved was a recurrent theme. The exact number of journals that are not being preserved is unknown, but some facts are:
- 79% of the 100,000 serials with ISSN are not being known to be preserved anywhere. It is not know how many serials that do not have ISSNs are being preserved.
- In 2012, Cornell and Columbia University Libraries (2CUL) estimated that about 85% of e-serial content is unpreserved.
This digital “dark matter” is dwarfed in scope by existing and anticipated scientific and other research data, including that generated by sensor networks and by rich multimedia content.
Economies and Diseconomies of Scale
Perhaps the most important question raised at this workshop was the question as to whether we as a community are really at scale yet? Can we yet leverage true economies of scale? David Rosenthal noted that as we centralize more and more preserved content in fewer hands, we will be able to better leverage economies of scale, but we will also be increasing risk of a single point of failure.
The consensus of the group seemed to be that, as a whole, the digital preservation community is not yet truly at scale. However, the organizations in the room have moved beyond a project mentality and into a service oriented mentality, and are actively seeking ways to avoid wasteful duplication of effort, and to engage in active cooperation and collaboration.
Workshop presentations and notes on each presentation are available at: https://drive.google.com/folderview?id=0B1X7I2IVBtwzcGVhWUF0TmJIUms&usp=sharing