Issues in Science and Technology Librarianship
Department of Information and Library Science
School of Informatics and Computing
Librarians are increasingly expected to work with researchers to organize and store large amounts of data. In this case study, data management novices undertake responsibility for a legacy public health research dataset. The steps taken to understand and manage the legacy dataset are explained. As a result of the legacy dataset experience, the authors of this study identified three main issues to resolve during a data management project: file organization, contextualizing data, and storage and access platforms. Finally, recommendations are made to help librarians working with legacy data identify solutions to these problems.
Librarians have long been considered information caretakers. In the modern world, the information being collected and preserved is often vast amounts of electronic data. To continue to fulfill this need and assist researchers with their data, libraries have begun implementing data management services. Often, librarians without formal data management training are being asked by administrators to take on data management responsibilities.
This article describes a case study in data management. A data management librarian asked the authors, a group of Indiana University Department of Information and Library Science (DILS) graduate students, to manage a legacy dataset. The dataset spanned several decades, contained 856 data files, and included myriad file formats. (In order to observe all ethical practices, we will refer to the researcher who created the data as Dr. Smith.) Based on our examination of the data, we provide recommendations for librarians who have no formal data management training and possess few data management resources.
During our data examination, we needed to agree upon terminology. The terms management, curation, preservation, and stewardship are often used interchangeably when referring to data. For the purpose of this paper, we decided to use the term management to refer to the acts of organizing, providing context, and determining storage and access for our dataset.
When we studied the current literature on data management in libraries, we found the journal articles generally informative. However, much of the information is theoretical and does not provide guidelines for librarians who do not have formal data management training. There is a need for basic guidelines these librarians can follow, including simple recommendations for organizing files, providing context to the files, and determining storage and access platforms. These are the main obstacles faced by librarians and the aspects we have chosen to focus on in this case study.
Current literature outlines many reasons for library involvement in research data management. Library science skills and subject expertise help librarians to address data management needs, including identifying electronic data issues, finding tools to address those issues, choosing proper data formats and metadata schemas, and selecting data transfer and storage protocols in and across subject areas. Librarians regularly collaborate across disciplines, which is important since communication with technology and subject experts is common in data management projects (Bardyn et al. 2012; Ferguson 2012; Garritano & Carlson 2009; Latham 2012). Furthermore, libraries should supervise data management because they have the capacity to manage diverse data types. A large volume of the researchers' data is not properly managed; the researchers need libraries' guidance. Ideally, data management activities will take place through observation and participation within the data's researcher communities from the beginning of a project (Heidorn 2011).
Librarian training opportunities for data management exist and include professional development in the form of workshops and webinars; continuing education certificates offered by library schools; and self-directed training through many freely available online resources. (For examples of these resources, please see Library Training for Data Management and General Best Practices/Records Management in the Appendix.) Formalized education in data management is just beginning to be added to Master's-level library science programs. Because it is not yet widely taught, the lack of new graduates with data management training can be problematic for libraries as they seek someone with this skill set.
At the beginning of this project, we received a dataset created during the 1970s, 1980s, and early 1990s by Dr. Smith, who gathered and worked with public health data. The body of work included:
Files were originally created and stored on mainframe computers and Unix systems, many needing the common statistical software SPSS to run. The two descriptive documents included in the files attempt to provide enough information for other users to navigate the dataset. One document is a Microsoft Word file created by Dr. Smith to explain the names of the folders. The second document is a ReadMe text file with information gathered during the data management librarian's interview with Dr. Smith. This file gives context and explains naming conventions and related documentation, including the survey instruments used to gather the data and articles published on the data.
In general, three major file types (data, input, and output) comprise the dataset. The data files contain coded responses to questionnaires designed by Dr. Smith, which appear to the viewer only as long series of numbers with no contextual information. Input files are a series of commands that tell the software how to mine the data. The input files then generate output files. Output and print out files have tables and charts used by Dr. Smith to present data in published articles. Copy files can be any one of the three major types of files. The file extensions found within the dataset can be split between the major file types as follows:
We ran into several problems when first assessing the dataset, including:
After looking through the data and encountering problems, we needed to do more research about the software programs and files. While we learned a lot about Dr. Smith's research by examining the dataset and consulting her web site, we needed more assistance. We were lucky enough to have a well-established IT department on-campus, including a specialized statistics and math IT center. The staff members at this center were able to provide solutions to many of our identified problems, as well as potential future problems. The questions we asked included:
With these considerations in mind, we focused our efforts on three baseline practices: organizing the data, providing context for the data (if possible), and identifying storage and access options for the future.
Dr. Smith's articles, especially the sections about methodology, and supporting documents can provide context for understanding the dataset's collection and analysis. The questionnaires and articles that accompanied our dataset informed the creation of basic metadata. Using a metadata standard such as Dublin Core or METS is advisable since these are widely used and compliant with many digital repositories. The Digital Curation Centre provides additional guidance in selecting a metadata schema. (You can find more information in our Appendix under Organization & Providing Context.) There are several possible formats for storing metadata, but CSV or XML are preferable due to their extensibility.
After we developed an understanding of our data, we asked ourselves whether we should keep all the data--and if we could not keep the data, how could we decide which data to keep? We considered several factors in answering these questions, focusing on who would be using and accessing this data in the future. Because we wanted to open the dataset to all users, we decided to keep the complete dataset. While cost is an important issue, institutions should proceed cautiously when deciding whether or not to remove data. Beyond wanting to open our data to all users, we chose to err on the side of caution and keep our dataset intact because we did not feel that we had enough information to delete files from the dataset.
We originally utilized the information contained within the dataset's master ReadMe file to try creating a new naming convention for the folders and files. Given the brief time frame of our project, we selected a sample set of our data to organize.
All files within a folder have unique relationships. Data files are only understandable when linked to questionnaires. SPSS files are connected to individual data files. Output files are the direct result of a specific SPSS file. We decided including this contextual information in our file naming convention would be helpful. Future researchers would be able to quickly and easily find all the files connected with a specific document or study. We wanted to ensure that if the files were accessed individually without the context that the folder provided, researchers would still have the best possible understanding of the file. In its current state, if files were taken from the folder there would be almost no way to tell where they originated. Including the folder name in the file name would ensure this situation could be avoided.
As a result, we decided to add the folder name to each file name, use the original file name, and the source documents associated with each file. This was our final file naming convention:
folderName_originalName_sourceName_sourceFormat [questionnaire or data file]
While we were pleased with our final naming convention, we realized our attempt to provide context created a major issue. All SPSS files link back to another data file. To do this, the syntax within the files actually states the full name of the data files. Changing the names of data files would make it impossible for SPSS files to run properly because the connection between the two would be lost. Modifying the real data file name would not affect the name listed in the SPSS file. Ultimately, this problem was too significant to ignore or explain in a ReadMe file. As a result, we had to abandon the idea of renaming files and consider our other option.
Instead of renaming files, it is just as effective to provide context with a more extensive explanatory document. We decided to use the Data Curation Profiles (DCP) Toolkit, a freely available tool developed by the Purdue University Libraries and the Graduate School of Library and Information Science at the University of Illinois Urbana-Champaign, to help us structure a follow-up interview with Dr. Smith. DCPs are documents that are intended to be completed jointly by a data manager and researcher. They contain 13 modules, several of which can be modified to fit the particular dataset, which act as prompts for the purpose of collecting additional metadata and contextual information about the dataset.
We felt that a DCP could capture the context of this data in a more in-depth way than the available ReadMe file (see Appendix). While it would require a follow-up interview, it solved many of the problems we faced. We appreciated the easy-to-follow format of the profiles, as well as the fact that we could use a tool that was already available and being used widely by other institutions.
Next we needed to determine how to securely store our dataset so that it could be accessed by all researchers and interested parties in the future. The main consideration with storing the dataset is whether it is secure and backed up. We also chose to prioritize storage options that provide access options for users, not just preservation. Our university has an institutional repository, so this would be the option we would select for our dataset. However, we wanted to consider the question of storage and access hypothetically to provide ideas for other librarians.
An institutional repository would be the best place to securely store, preserve, and provide access to the dataset. Another option that provides storage, preservation, and accessibility would be a domain repository, though some have costs associated with use. The indexes DataBib and Re3data allow users to search and find worldwide domain repositories. (You can find the URLs for these indexes in the Storage & Access section of the Appendix.)
If an institutional or domain repository is not an option, storing the data on institutional server space is another possibility. This is preferable to storing data on cloud storage, an external hard drive, or another comparable local storage solution. Ideally these methods would only be used to back up the dataset, since they provide only preservation without access options for users.
Regardless of the initial vehicle for storing your data, the data should be backed up following the basic guidelines provided by the MIT Data Management Subject Guide. (You can find this URL in the Storage & Access section of the Appendix.)
After determining a storage platform that handles data preservation, an important consideration is to determine how users will be able to access the data. We were working with public health data, which introduced special considerations. While there is no national standard for managing sensitive data, public health data may involve private health information. It is important to keep human subject implications in mind prior to making any data available. The researcher who created the data would be best able to determine whether there are any human subject implications, but if they are deceased or otherwise unavailable, consult your institution's research ethics committee or Institutional Review Board (IRB).
If there are not human subject implications and the researcher is willing and able to make the data freely available, the library should make it a priority to open the data to all users. Assigning an open access license of the researcher's choice should not be a problem in institutional or domain repositories; be sure that your repository allows the researcher to retain control over licensing the data. If you do not have access to a repository, you can recommend the researcher use a third-party service that provides access and preservation, such as Figshare. (For more information about Figshare, please see our Appendix under Storage & Access.)
Based upon our experience managing this dataset, we suggest that others follow the guidelines listed.
While data management can be a complex and overwhelming subject, there are simple steps that librarians can take to manage a dataset similar to ours. By sharing our experience with organizing, giving context, and determining storage and access options for our dataset, we have created general guidelines for managing legacy public health research data. Additional case studies of this nature would be beneficial to librarians as the profession continues to define its role in data management work. These case studies would provide even more help for the librarian with data management responsibilities, a role that will increase in importance as data continues to be generated in larger quantities. In addition, these case studies could inform data management training in the library school curriculum, defining which skills are the most important for librarians with data responsibilities.
The authors would like to thank Brian Winterman and Stacy Konkiel for their support and guidance during the writing of this paper.
Bardyn, T.P., Resnick, T., & Camina, S.K. 2012. Translational researchers' perceptions of data management practices and data curation needs: findings from a focus group in an academic health sciences library. Journal Of Web Librarianship, 6(4), 274-287.
Ferguson, J. 2012. Description and annotation of biomedical data sets. Journal of eScience Librarianship 1(1), 51-56.
Garritano, J.R. & Carlson, J.R. 2009. A subject librarian's guide to collaborating on e-science projects. Issues in Science and Technology Librarianship [Internet]. [Cited 2013 June 11]; 57. Available from http://www.istl.org/09-spring/refereed2.html
Heidorn, P.B. 2011. The emerging role of libraries in data curation and e-science. Journal of Library Administration 51(7-8), 662-672.
Latham, B. & Poe, J. 2012. The library as partner in university data curation: A case study in collaboration. Journal Of Web Librarianship, 6(4), 288-304.