Previous   Contents   Next
Issues in Science and Technology Librarianship
Fall 2002

URLs in this document have been updated. Links enclosed in {curly brackets} have been changed. If a replacement link was located, the new URL was added and the link is active; if a new site could not be identified, the broken link was removed.

[Board accepted]

Building Digital Archives for Scientific Information

Leah Solla
Chemistry Librarian
Cornell University

Researchers, librarians, and publishers have valid concerns about the long-term availability of digital information. Archives of scientific literature in the print format have long existed as library collections and librarians have developed methods to preserve them for continued future use. Some of the preservation issues are parallel between print and digital formats, such as duplication and sustainability. However, the roles of stakeholders are changing in the digital realm. Publishers are now active participants in developing archives of their digital information and are increasingly assuming responsibilities for long-term preservation.

Archiving involves some level of organization and preservation to enable potential use. The preservation of collections archived in libraries has developed into systematic methodologies to sustain the content and infrastructure as well as the physical aspects of the materials. Standards have developed to scale the process and to ensure interoperability among preservation schemes. Effective digital preservation models need to be self-sustaining, and adhere to format standards. Robustness of the approach becomes increasingly critical as more information is born digital and there are no print duplicates to preserve by the established methods.

Roles of stakeholders and new standards need to be established for archiving and preservation of digital information, requiring communication and collaboration between traditionally distinct groups, including researchers, publishers and libraries. Cornell is one of many libraries working on digital archiving and preservation issues, investigating a role for libraries akin to their traditional role with print, based on subject collections. Archiving across subject areas in the academic environment complements the archiving approach of publishers in the competitive market environment.

A number of studies arose from early digitization projects aimed to broaden access to Cornell's unique collections. After a variety of digital collections were produced, it became clear there was need to consider the future of these new resources. Project PRISM is one of the earliest of these studies to address digital preservation. PRISM stands for preservation, reliability, interoperability, security, and metadata. Maintaining the integrity of valuable information in any format is more complex than just keeping it around. To ensure integrity and usefulness, the PRISM issues need to be addressed at the planning stage for digital libraries.

{PRISM} is a four-year project funded by Phase 2 of the Digital Libraries Initiative from the National Science Foundation to develop risk assessment strategies for web resources. It is a collaborative project between the Cornell Computer Science Department, the Human-Computer Interaction Group of the Cornell Communication Department, and Cornell University Library. Specific focus areas of the project include: digital object architecture, digital preservation, human-centered research, interoperability, policy enforcement, and web preservation. These research areas are tested using the National Science Digital Library infrastructure also in development at Cornell.

Several disciplines are represented by the research participants, and working across the disciplinary lines has often proven challenging to the vision of the project. The primary archivist involved in the project, Anne R. Kenney, has likened the collaborative experience to the early, self-centered stages of childhood development. "Creating workable digital libraries is a team sport and we need to prefect the art of cooperation to succeed at this game" (Kenney 2000).

The Mellon Foundation funded a program of investigations into the archiving of electronic journals. Several of the projects partnered research libraries and publishers, developing strategies based on the publishers' collections. Cornell was one of two universities to look at the literature within a subject area. {Project Harvest} sought to coordinate preservation efforts between librarians and publishers producing agricultural literature. The librarians initiated dialogue with several agricultural publishers, only to find that the goals and business models of librarians and publishers did not easily intersect (Kenney & McGovern 2001). Agriculture is a broad discipline, covering a vast array of information and often driven by commercial interests, which may not at first take coincide with the sharing of information ideal of libraries.

The potential for complimentary digital archiving efforts within subject areas continues to be explored through other projects at Cornell, primarily in mathematics and physics. Both of these disciplines are well defined, and primarily academic driven. Past literature is especially key to research in mathematics and other scientific disciplines that build upon it. Researchers concerned about the future of scholarly communication in their fields are central players along with librarians and publishers in developing strategies for long-term access to diverse digital information resources.

The Electronic Mathematics Archives Initiative (EMANI) is the collaborative effort of several international libraries and content providers to address archiving, repository and dissemination issues of mathematics literature. Initially digital content will be stored at participating research libraries, with plans developing for enhanced access and linking worldwide. Kizer Walker, digital projects librarian for Engineering, Mathematics and Physical Sciences reports that: "EMANI applies libraries' growing expertise in archiving, preserving, and providing access to digital materials to the task of managing Springer's growing digital backfiles in mathematics. The result promises to be an archive and online dissemination system for math materials that is responsive to needs of academic libraries and the scholarly communities they serve. Springer is playing a leadership role in involving librarians and scholars in the long-range strategic planning for its electronic publishing processes. In the future, EMANI may expand to include other scientific disciplines and other publishers" (Walker 2002).

Project Euclid is an electronic publishing initiative in mathematics aimed at providing a shared web platform for society and independent publishers of math journals, easing their transition to the electronic environment and allowing them to compete with large commercial publishers. So far, 15 titles are available through Euclid and the project concern for long-term data retention is present in the mission of the project, and ongoing efforts will monitor the development of file formats and platforms. The project will be evaluating the viability of its funding model in January 2003 to determine future directions of the project.

The Digital Mathematics Library (DML) is a broad, international initiative to build a comprehensive digital library of published knowledge in mathematics. The NSF planning grant is aiming to coordinate several digitization efforts and set the stage for further collaboration between mathematicians, librarians and publishers. Program goals include developing guidelines for selection and prioritizing as well as technical standards and models to address intellectual copyright, management and economic sustainability. The initial meeting in July 2002 was very well attended and working groups are underway to investigate the issues.

arXiv is a global knowledge network of physics e-prints developed in 1991 to improve scholarly communication in various areas of physics research. Current plans are in development to upgrade the interface and consider extending the archive to other disciplines. The main site is hosted by Cornell and is mirrored around the world. Currently Cornell also supports mirror sites for PROLA, Zentralblatt MATH and EMIS scientific databases.

The International Union of Pure and Applied Physics is considering a registry of all digital physics information, a database of scope, format, standards, etc. In the immediate term, the registry would serve as a guide to developing electronic physics projects. In a longer term, the registry can serve as a jumping off point for developing a coordinated and interoperable system of digital information in physics.

The scope of projects begs for a synergistic approach to tie them together. A common depository system is under development at Cornell to monitor these various projects and guide them in mapping to emerging archiving and preservation models and standards. This institutional-level project is specifically focusing on long-term file formats, preservation metadata, functional requirements for storage and access, and guidelines for deposit. The library has created the position of the {Digital Preservation Officer} (DPO) to oversee the common depository system and liaison between the various digital information projects at Cornell.

To effectively address issues of archiving and preservation both access and responsibility need to extend beyond individual institutions. Both redundancy and interaction between institutions will increase the robustness of the integrity of digital information. The establishment of mirror sites is one example of how responsibility for digital information is spreading. sThe complementary nature of archives developed by publishers and the subject collections of research libraries expands on simple redundancy. Collaboration between librarians, content providers and researchers is continuing to develop as disciplines become focused on integrating and preserving electronic data.


Kenney, A.R. 2000. "Collaborating Across Lines: Librarians, Archivists, and Computer Science Researchers in Cornell's Project Prism." Annual meeting of the Society of American Archivists, Denver, CO. (August 31, 2000).

Kenney, A.R. & McGovern, N.Y. 2001 "Cornell's Project Harvest: Subject-Based Digital Archives," CNI Task Force Meeting. [Online]. Available: {} [November 23, 2002].

Walker, K. 2002 EMANI; Electronic Mathematical Archiving Network Initiative. EMPSL Standard. [Online]. Available: {} [November 23, 2002].

Previous   Contents   Next

W3C 4.0