Recently there has been a lot of discussion about shortening rebuild times for large disk drives in enterprise data storage environments. Faster rebuild technology is available, but many storage administrators don't think enough in terms of hardware RAID and individual drive rebuild times. Perhaps the best way to shorten rebuild times and still have reliable data protection is to not have to rebuild in the first place.
Storage vendors are beginning to understand that data protection is not about protecting disks, but protecting information, and their data protection schemes are evolving to reflect this. There are some novel approaches in the market to solving the problems produced by large and slow drives. Some technologies reduce the overall number of rebuilds a system performs. Other technologies have shifted to information-based data protection schemes in which, rather than mirroring a disk, they mirror information (files, chunks or objects). And some even do a little of each. So how does this impact rebuild times? When you think in terms of rebuilding information rather than a single disk, you can put the power of the system architecture to work, leveraging the massive parallelism opportunity presented by multidisk architectures.
Several technologies in the market today reduce the overall number of drive failures, and thus the number of rebuilds required. In some instances, vendors take unresponsive drives offline to diagnose problems and return them to service if no trouble is found. This is a great approach, as it eliminates the need to perform a full rebuild. When the drive goes offline, the system journals all writes that would have gone to that drive while attempting to recover the drive. After a successful recovery, only the data in the journal is required to be rebuilt, not the entire disk.
Some vendors have a two-pronged approach that reduces the overall number of rebuilds required and speeds rebuild time leveraging grid storage architectures. One approach kicks in when a drive doesn't respond immediately to an access request. The system responds by doing a mini parity rebuild of the requested data and returning the rebuilt data while taking the non-responsive drive temporarily out of service. This drive then undergoes a brief diagnosis and is returned to service, thereby eliminating the need for a rebuild. Any data written while the drive is offline is written to other available space in the system.
This also speeds rebuilds by putting a drive's grid architecture to work. Most grid-based architectures have capacity or storage nodes and separate processor nodes. Typically, all processor nodes can access all capacity nodes. When data is written, it's broken into a number of fragments. These fragments are then distributed across as many storage nodes as are in the system. Using a default of nine data fragments and three parity fragments (the exact number of parity fragments is user configurable), each of 12 storage nodes gets a fragment. If there are four storage nodes (the minimum configuration), each node gets three fragments. In the event of a drive failure, the data from that drive is rebuilt, just like in conventional hardware RAID. But unlike conventional RAID, data isn't rebuilt to a single drive; the data is redistributed across the storage nodes leveraging any available storage capacity. If an entire storage node fails, the data from those drives is rebuilt across the remaining storage nodes. We've seen this type of technology implemented for both parity-protected data and mirrored data. Thanks to protecting data rather than disk drives, as well as the power of a grid architecture, rebuilds happen in a fraction of the time it would take for a conventional drive rebuild. This is because it's the information that's being rebuilt, not the exact drive layout.
Other vendors seek to leverage their architectures to speed rebuild time and reduce the risk of data loss if multiple drives fail. When a file is written, the data and parity is distributed across the available disk drives in the cluster. In the event of a drive failure, the data required for a rebuild is spread across multiple nodes in the cluster, so drives across the entire cluster are leveraged.
Furthermore, shifting data protection strategies from a hardware-based approach to a software-based approach creates even more possibilities. With a hardware-based protection scheme, the choice is often between protecting all of the data or none of it. Information-based data protection opens the door to the possibility of more granular, policy-based information protection.
In the end, different storage characteristics are required for various data types. Hardware RAID schemes continue to be a good solution for lower capacity, faster drives and won't go away any time soon. But it wouldn't be surprising to see information-based data protection schemes become more mainstream in tier 1 storage products over time, as vendors continue to simplify administration and build information-centric systems.
There are plenty of vendors offering information-based data protection schemes or rapid rebuild technology. Even in a tough economy, the number of vendors offering technology that accelerates or reduces the need for rebuilds seems to be growing. Remember that when you're evaluating technology that leverages high-capacity commodity disk drives, you should ask your vendor what they're doing to reduce your exposure to data loss during rebuilds.
This article originally appeared in Storage magazine.
About this author: Terri McClure is a senior analyst at the Enterprise Strategy Group, Milford, Mass.