Although storage administrators tend to think that "hardware failures" and "disk failures" are synonymous, a recent...
study by researchers at the University of Illinois at Urbana Champaign and NetApp Inc. suggests that the majority of failures in backup data storage systems, as well as primary storage arrays, are not caused by disk failures.
In a paper titled Are disks the dominant contributor for storage failures?, Weihang Jiang and his co-authors Chongfeng Hu, Yuanyuan Zhou and Arkady Kanevsky, and NetApp found that up to 80% of the storage system failures aren't in the disks at all. The authors surveyed approximately 39,000 storage systems with 155,000 enclosures containing roughly 1.8 million disks over a period of 44 months. The paper was presented at February's USENIX Conference on File and Storage Technologies (FAST 08).
According to Jiang and his co-authors, only between 20% and 50% of the disk system failures were due to the disk drives. The rest came from other causes, most notably interconnection problems.
In fact, the probability of an interconnect failure probably increases faster than the probability of a drive failure in these ultra-crowded enclosures. It definitely makes it harder to get at the connections to check them.
As the paper notes: "There are other storage subsystem failures besides disk failures that are treated as disk faults and lead to unnecessary disk replacements." This helps explain a couple of well-known, but puzzling phenomena associated with storage system failures. The most obvious is that users report disk failure rates in service that are between two and four times the rates calculated by manufacturers.
It also helps explain the phenomenon of Trouble Not Found (TNF) -- returned drives that check out fine back at the factory. Some manufacturers put the rate of TNF drives as high as 50% of all those returned as failed. While there are many other causes for pulling a good drive as failed (such as problems in the protocol stack), replacing a drive is likely to fix a flaky connector, at least temporarily.
But, not all the non-disk failures are interconnect-related. Other causes of trouble include protocol stacks and shelf enclosures. Perhaps the most depressing conclusion in the paper is that storage failures aren't random. The appearance of a failure increases the chances of another failure of a similar type.
The authors have several suggestions for helping the problem of non-disk failures. The simplest is to span the drives in an RAID group across more than one shelf in the drive enclosure. Another is to use multiple interconnects between drives for greater redundancy. The paper says that storage subsystems with two independent interconnects experienced a 30% to 40% lower annual failure rate than those with a single interconnect.
About the author: Rick Cook specializes in writing about issues related to storage and storage management.