Home > Data Backup News > Don't blame the disks: Why data storage fails
Data Backup News:
EMAIL THIS LICENSING & REPRINTS

Don't blame the disks: Why data storage fails

By Rick Cook
31 Mar 2008 | SearchDataBackup.com

Data backup news
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google

Although storage administrators tend to think that "hardware failures" and "disk failures" are synonymous, a recent study by researchers at the University of Illinois at Urbana Champaign and NetApp Inc. suggests that the majority of failures in backup data storage systems, as well as primary storage arrays, are not caused by disk failures.

In a paper titled Are disks the dominant contributor for storage failures?, Weihang Jiang and his co-authors Chongfeng Hu, Yuanyuan Zhou and Arkady Kanevsky, and NetApp found that up to 80% of the storage system failures aren't in the disks at all. The authors surveyed approximately 39,000 storage systems with 155,000 enclosures containing roughly 1.8 million disks over a period of 44 months. The paper was presented at February's USENIX Conference on File and Storage Technologies (FAST 08).

According to Jiang and his co-authors, only between 20% and 50% of the disk system failures were due to the disk drives. The rest came from other causes, most notably interconnection problems.

Fixing a disk drive interconnect
Sometimes fixing a bad interconnect is as simple as re-seating the connector. Other times, a quick squirt of contact cleaner will solve the problem. Don't forget to check the other end of the connection for problems as well.
Of course, everyone who works on hardware knows that connections are a notorious source of mysterious troubles. Interconnect problems become more likely as we cram more and more physically smaller drives into a single enclosure. For example, the research paper points out that the EMC Corp. Symmetrix DMX-4 enclosure can hold up to 2,400 drives.

In fact, the probability of an interconnect failure probably increases faster than the probability of a drive failure in these ultra-crowded enclosures. It definitely makes it harder to get at the connections to check them.

As the paper notes: "There are other storage subsystem failures besides disk failures that are treated as disk faults and lead to unnecessary disk replacements." This helps explain a couple of well-known, but puzzling phenomena associated with storage system failures. The most obvious is that users report disk failure rates in service that are between two and four times the rates calculated by manufacturers.

It also helps explain the phenomenon of Trouble Not Found (TNF) -- returned drives that check out fine back at the factory. Some manufacturers put the rate of TNF drives as high as 50% of all those returned as failed. While there are many other causes for pulling a good drive as failed (such as problems in the protocol stack), replacing a drive is likely to fix a flaky connector, at least temporarily.

But, not all the non-disk failures are interconnect-related. Other causes of trouble include protocol stacks and shelf enclosures. Perhaps the most depressing conclusion in the paper is that storage failures aren't random. The appearance of a failure increases the chances of another failure of a similar type.

The authors have several suggestions for helping the problem of non-disk failures. The simplest is to span the drives in an RAID group across more than one shelf in the drive enclosure. Another is to use multiple interconnects between drives for greater redundancy. The paper says that storage subsystems with two independent interconnects experienced a 30% to 40% lower annual failure rate than those with a single interconnect.

About the author: Rick Cook specializes in writing about issues related to storage and storage management.



Tags: Disk-based backupRecent disk backup newsVIEW ALL TAGS

Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
SEARCH 
TechTarget provides enterprise IT professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective IT purchase decisions and managing their organizations' IT projects - with its network of technology-specific Web sites, events and magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Reprints  |  Site Map




All Rights Reserved, Copyright 2008, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts