A Fault-Tolerant Architecture for Supporting Large Scale Digital Libraries

Mariella Di Giacomo
mariella@lanl.gov

Los Alamos National Laboratory
Los Alamos, New Mexico 87545

Abstract

At the Research Library of Los Alamos National Laboratory (LANL), we have developed an affordable multi-terabyte redundant RAID system -- running on Linux -- that addresses the needs of a constantly growing demand for electronic journals, conferences and standards. Although there are several practical issues associated with systems supporting electronic content -- such as acquisition, archiving, access and cataloging -- this paper will focus on archiving and providing access to electronic journals. This article explains the design and development of a scalable, fault-tolerant and affordable system, presents an overview of the architecture, summarizes the problems and architectural issues encountered, and describes the technical solutions adopted.

Introduction

Imagine a digital library serving approximately twenty thousand employees, 24 hours a day, seven days a week. Most of those employees are scientists, staff members, and graduate students spread out over forty-three square miles in a mountainous region. How can the library overcome the challenges of the environment and provide as much scientific information as possible to the LANL employees and 25 external institutions? The solution entails the development of cutting-edge services and systems capable of managing multi-terabytes of data, such as electronic journals, databases, scientific reports, conference literature and standards, while also providing fast access to the source data.

Electronic journals are the most important information resource supporting research and scholarship at scientific laboratories. It is therefore crucial for libraries to ensure the availability of this new medium if they are to meet their goal of providing relevant materials to researchers. The LANL Research Library has, as part of its Library Without Walls project, resolved the issues involved in archiving and providing access to electronic journals, in the rapidly changing world of electronic publishing. There are other practical problems associated with supporting electronic journals, such as acquisition, cataloging, and providing technical support. This paper will focus on system architectural issues and the solutions adopted to archive and provide access to electronic journals at the Research Library, as well as briefly introducing the architectural solution adopted for gathering data and its interaction with the whole system.

When one envisions the concept of digital libraries, and electronic journals in particular, one pictures unlimited information and boundless computing power. Yet the structure of any digital library requires the universal building block of our digital age: disk space. Processing power and memory technology have increased readily in the last few years and disk capacity has also kept pace with these improvements. Nonetheless, disk space is always at a premium in a digital library. Disk space is analogous to shelving space in physical libraries. Although the surface area of shelving space to disk space is an order of magnitude larger, the tensile strength of the latter is orders of magnitude smaller. Moreover, the cost involved in storing and delivering electronic content can be formidable, well outside the reach of many libraries and universities, particularly those using cutting-edge technologies and supporting powerful, flexible, reliable and fault-tolerant systems (Gibson and Patterson 1993).

Design Challenges

High availability (99.999 often called "five 9s") (Fox and Patterson 2003) is one of the commonly used terms categorizing the degree of reliability in operating a large computer system. In order to build and maintain systems capable of running 24 hours a day, seven days a week, it is important to consider the following challenges: flexibility, fault tolerance, reliability, speed, and cost. Moreover, for digital libraries, it is important to consider the obstacles affecting local archiving. When one considers local archiving, one of the major limiting factors is disk space. Although, the disk space requirement is quite relevant, it represents only one portion of the cost and only one of many key points relevant to the issue of local archiving. Local archiving entails more significant costs in hardware, administration and maintenance.

Building a flexible system is critical. The electronic content our library provides to its users increases steadily, approximately two terabytes every year. It is important to have a scalable system, because as application software and data expand in size, storage requirements proportionately increase. Although, over the past few years, processing power has steadily increased and disk capacity has kept pace, the information generated for archiving, indexing and managing the scientific data increases the system and hard disk workload. On one hand, electronic data involve bigger indexes, higher usage of the system and higher backup costs. On the other hand, the system and hard disk workload increases the potential of failure. Hard disk failure causes not only work disruption, but also potentially loss of valuable data and applications. The importance of fault tolerance is substantial because digital libraries cannot afford the risk of losing data; even with backup systems in place, it is better not to rely only on them, because in the event of a disaster, it could take from a few hours to several days to recover data from backup media. Scientific data need to be available at all times and the service must have as few disruptions as possible, not only because of the potential impact on productivity, but also because scientists rely on these resources to meet deadlines, submit articles, and compete for funds that can be very critical in highly competitive fields. These considerations mandate that an indispensable research resource needs to have as little service disruption as possible.

The critical factors for a fault-tolerant archiving and delivery system that should be considered are the following: storage capacity, response time and high availability. Indexed data and content cannot be lost or corrupted, and the system needs to be up and running, otherwise users cannot rely on it.

To provide storage capacity and mitigate hard disk failure, Redundant Arrays of Inexpensive Disks (RAID) (Patterson et al. 1988) technologies are the solution of choice to protect and manage precious digital assets. Today, as the speed of bus, processor and memory have drastically increased, the performance of the entire system often depends on the disk storage component. RAID technologies offer faster access to hard disks in addition to redundancy. The RAID solution is a method of storing the same data in different places, (thus, redundantly). By striping data across multiple disks, I/O operations can overlap in a balanced way, improving performance. Since writing data across multiple disks increases the mean time between failure (MTBF), storing data redundantly also increases fault tolerance.

Fault tolerance, reliability, and availability are characteristics that are intimately interdependent. In order to have a reliable system, not only the storage system but also the entire system needs to be fault-tolerant. This suggest that the server, accepting queries from customers and delivering content, needs to have an uptime of five 9s. The solution in this case was to use two Linux servers (Welsh and Kaufman 1996), with a primary server accepting requests and delivering content, and a secondary server maintaining a copy of the same software and indexes -- and the ability to detect primary server failure. Therefore, the concept of reliability and continuous availability consists of detecting and compensating for any single point of failure.

Speed is an important variable; having a powerful and fast system that allows fast access and response time saves time for researchers and creates incentives to use the system repeatedly. It is important for system designers to keep up with processor, memory, bus technology and announcements in order to build one of the best systems possible at any moment, so that the system does not need to be upgraded or replaced after a short amount of time. This choice may lead to a problematic situation: The new equipment can be predisposed to failures; so it is important to have a testbed environment in which systems and software can be tested and stressed to ensure that the new hardware and software is stable before being placed into general operation.

The final key question and driving issue is cost. The cost involved in acquiring RAID systems and cutting-edge servers can be significant. When addressing the cost issue, the question of whether cost is an overriding concern is most relevant. If the cost component is indeed a minor consideration (as it often is not for many universities and libraries), it makes sense to acquire the best equipment that will provide fault-tolerant systems. When cost is a key issue, it is important to investigate the market and acquire technology products offering the same features as expensive competitors, but at an affordable cost, and to assemble locally (integrate) a fault-tolerant system. The system described in this article was built in this manner. If a commercial redundant RAID system had been purchased, a multi-terabyte system would not have been possible.

Architecture for a Fault-Tolerant System Integration

Figure 1. System Architecture

Figure 1 shows the architecture of the system. Three operational criteria are vital to the system success: no loss of data, continuous availability of the system, and good system performance. To satisfy these criteria, we have built a redundant, fault-tolerant multi-terabyte RAID system with good performance and response time.

This section introduces the initial architecture choice, describes the final fault-tolerant solution explaining the technical solution adopted, and explains the reasons for these choices.

When the project started, the architecture consisted of a Linux server and a RAID system storing approximately two terabytes of data. The Linux server was accepting user requests, connecting to the RAID system through a small computer systems interface (SCSI) card, acquiring data, managing the index and delivering electronic content. The RAID system architecture consisted of two theoretically redundant RAID controllers. Each RAID controller could communicate through nine low voltage differential (LVD) SCSI channels, and each one of them could support up to sixteen hard drives. It was later learned that all the hard drives on a RAID array needed to have the same size and firmware revision. The hard drives were assembled and rack-mounted in disk array cases (eight hard drives per case), and their SCSI interface was connected to the redundant RAID controller channels through SCSI cables. Each redundant RAID controller channel was communicating with sixteen hard drives. The RAID controllers were supposedly redundant components that exchanged heartbeats, so that in the case of one controller failure, another controller would take over the data from the failed component. Each controller could be monitored for its status, load, and integrity via a serial interface.

Unfortunately, after a short period of use, the RAID controllers started to fail without any obvious reason. Several tests were performed to isolate the problem.

It was later discovered that, although it was theoretically possible to maximize the number of hard drives on each redundant RAID controller channel, this scenario proved to be impractical. Constant communications with the RAID controller manufacturer were needed, including site visits. To overcome the problem, the number of hard drives for each redundant RAID controller channel had to be lowered to eight. In this scenario, the redundant RAID controllers functioned correctly and performance increased; furthermore, additional RAID controllers had to be purchased to manage the same amount of storage. While troubleshooting redundant RAID controller failures, other problems arose, such as interruption of data acquisition; indexes would become unavailable, and some of the Linux system hardware components failed.

After this first experience, the single points of failure were identified and studies done to compensate for the bottlenecks that would render the system inoperable. The initial project ran into some problems because the system was not totally redundant and the supposedly redundant components were not as redundant as described by the vendor. The following components were subjected to hardware-failure testing in that environment: power supplies, SCSI cards, network interfaces, RAID controllers and disk controllers.

The new design for a fault-tolerant RAID system architecture grew out of previous experience, considering all the components while concentrating on the major players of the system. Due to defined user needs, the system requires continuous operation and access. Three operational criteria are vital to success: no loss of data, continuous availability of the system, and good system performance. To avoid loss of data, the following criteria should be met: the redundant RAID controllers need to perform correctly without any firmware failure and the indexes should never be corrupted. If the search engine index fails, queries cannot be made, electronic journals cannot be accessed and retrieved, and new data cannot be added and made available. The optimal situation for having the RAID controllers performing correctly was to limit the number of hard drives for each redundant RAID controller channel, thus doubling the number of controllers. A side-effect of controller failure or system crash was the corruption of indexes used to search the data. A copy of the index was duplicated on a different redundant RAID system, to be accessed in case of a corruption.

In order to have a high-availability system, the hardware components from the Linux side became redundant. Each redundant RAID system was connected to the Linux system through two redundant SCSI cards, providing fail-over access to the RAID system. The network card was duplicated, because in the presence of a failure on the network device on the Linux host, data cannot be transferred or transmitted. Moreover, a mirror for the Linux server was designed. The Linux server was replaced by a system consisting of two identical Linux machines as the initial server, both running the same operating system and application software simultaneously and independently. The primary Linux server, accepting user requests and delivering content, consists of Linux kernel, file system and I/O utility subsystem, Red Hat software and packages, Perl modules, monitoring software, and a locally modified version of software provided by Science Server LLC. The secondary Linux server is an exact copy of the primary server, and it is always in sync with the primary, so that it keeps a copy, updated daily, of all the files and search engine data. The secondary server is connected to the same RAID systems that are accessed by the primary, so that if the primary server fails that secondary can provide access to them. Monitoring scripts have been added to check the operating system status and the application program.

Therefore, the result is a stable, accessible and well-supported system. The site is stable and has no history of becoming unavailable for short or long periods without warning or notification to users. The system is always accessible; it is not so limited in hours of availability or so overloaded that users are often turned away by slow response time.

Experimental Testbed

In order to evaluate performance, robustness and effectiveness of a fault-tolerant design, a testbed environment was implemented to test, observe and gather results. More often than not, the maintainers of such a system are not the users of the system. Systems of this complexity engender a similar complexity in verifying their operational status.

Two key components were crucial to the testbed process. First, the RAID controllers were tested by stressing file access, to observe the RAID controllers I/O, effectiveness and performance. Second, parallel searches were run to make sure the Linux system had a good response to multiple searches.

The concept of a continuously available design should minimize manual intervention; therefore, an ideal fault-tolerant design avoids the existence of single point of failure in the system. One problem related to keeping a system up and operating twenty fours a day is determining when it fails and why. This may, at first glance, sound trivial and obvious, but upon closer examination, it can be seen to be somewhat more complex.

One way of sensing a system's operating status is to check network connectivity on a regular basis. Even though this method was utilized, it proved to be an inadequate means of determining a system's uptime. When monitoring the LANL system, many checks were used. Most of the methods utilized simple Perl scripts that would page the appropriate person when needed. Some of the problems this particular machine experienced were web server failure, corruption of file systems, and corrupt indexes. The web pages resided on one file system while the PDF content files resided on another. One Perl script would monitor changes in the server system logs to see if the kernel was having problems. Another Perl script would monitor the status of the web server and query the search program by using the LWP module. This script would check a page every half-hour, and yet another script would perform a query on the search engine and check the result for correctness. Each and every half-hour this machine would be subjected to three to four different tests, depending on the machine's status at certain times.

Experimental Results

This section presents some experimental results that describe the responsiveness of our system and some historical data that show how the level of fault-tolerance evolved over several years.

The average response time for a search on the electronic journal, conference, and standard collection, from a local network access, is typically less than four seconds. It is one of our goals to keep this value as low as possible.

An effective test for detecting the presence of system faults is formatting multiple large Linux file systems at the same time. Concurrently formatting large file systems exposes the problems of the redundant RAID controllers and the Linux kernel. Consequently, this is a good indicator of the status of the system. Table 1 outlines the format times of four large partitions that are run concurrently in order to generate a high system load, much higher than the typical load in normal operating conditions of the server.

RAID Size (GB)	Format time (hh:mm)
235	4:29
233	4:23
200	3:35
200	3:51
Table 1. Format time of multiple partitions.

Finally, Table 2 shows how the availability of the system evolved over several years. It is worth noting the all the downtime in 2001, less than an hour, was due to scheduled maintenance. In 2002, we had five hours of scheduled downtime for electrical work in the room where the system is located and three hours of unscheduled downtime following site-wide power outages, for replacing a controller card and repairs to the uninterruptible power supplies (UPS).

Period of Time	Availability
July 1999-December 1999	60.000%
2000	95.000%
2001	99.996%
2002	99.903%
Table 2. Availability history of our system.

Discussion

At the time that this system was initially implemented in 1999, commercial solutions with integrated redundant RAID systems were few. The base system described was built from the ground up, by assembling the RAID system and Linux server components. On one hand, this action dramatically decreased the cost per megabyte of storage; on the other hand, it increased the technical staff support requirements. On the Linux server side, the main contributor to increased support requirements was the fact that many manufacturers had not tested hardware and firmware with Linux systems and Linux was continually coming up with new advancements of the kernel, especially on the I/O subsystem and file system management.

On the RAID system side, there were primarily two factors affecting support requirements. Firmware on the raid controllers required considerable attention, and the SCSI RAID redundant controllers had not been tested in a large system on the order of three terabytes. The RAID controllers were theoretically supposed to work in redundant mode, but in critical conditions they were failing. Moreover, when trying to assemble several hardware components to build a redundant RAID system, there were many variables involved, such as types of hard drives and firmware releases, communication protocols and equipment, temperature factors, cabling, etc. Constant communication with vendors was necessary.

Like any system, there were theoretical and logical constraints. Although it was theoretically possible to maximize the number of hard drives on each RAID controller channel, that scenario had not been tested. The disk space achieved by this system impressed even the most jaded technical-support personnel.

Conclusions

The prospect of supporting this system has taken its toll on the personnel involved. Although there have been costs involved with technical support, the value of having the local library staff learn the intricacies of this system and the inner workings of Linux were priceless.

The intangibles of this system are numerous. What price should be put on a system that offers its customers unlimited access to primary research resources? What cost does 24-hour support have on overworked employees? What price is the workload associated with this support?

Assembling the system was a difficult project that involved continuous effort for a long period of time. Valuable knowledge was acquired along with the achievement of a fault-tolerant system supporting a large scale digital library service. The lessons learned gave us confidence to build several of these systems, which are now deployed at LANL Research Library. We believe that some of the development that has been done here is applicable in many other digital library environments.

Acknowledgments

We would like to thank Kip Fischer of LANL who started this project, the Library Without Walls team who took turns monitoring its status during times of need and the rest of the Research Library staff for their patience and support during the worst of times.

References

Fox, Armando and Patterson, David. 2003. Self-Repairing Computers. Scientific American 288(6): 54-61.

Gibson, Garth A. and Patterson, David A. 1993. Designing Disks Arrays for High Data Reliability. Journal of Parallel and Distributed Computing 17(1-2):4-27.

Patterson, David A., Gibson, Garth A., and Katz, Randy H. 1988. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, June 1-3, 1988 (ed. by Haran Boral and Per-Ake Larson), pp. 109-116. New York: ACM Press.

Welsh, Matt and Kaufman, Lar. 1996. Running Linux. Sebastopol, CA: O'Reilly & Associates.

Contents

Previous	Contents		Next
Issues in Science and Technology Librarianship		Summer 2003
DOI:10.5062/F4TM7832