Published: 12 Jan 2006
Cut big backups down to size
Data-reduction technologies can slash the amount of data that gets backed up, making disk-based backup a cost-effective alternative.
The world of backup is in a state of flux, although it may not appear to be at ground level. There's more innovation and a greater variety of choices than ever before. The underlying enabler for this torrent of change is the availability of low-cost, high-capacity disk. But it would be misleading to view disk-based backup as a monolithic approach.
If you've recently invested in disk-based backup--or are considering it--you may have experienced sticker shock at the overall cost of moving to disk. The benefits of disk may be apparent, but when your vendor plans a configuration for you, the amount of capacity required may surprise you. Accurately sizing a tape library and planning for growth are important tasks in tape-based architectures, but it's even more critical to get it right for disk.
Traditional backups require a significant multiple of the primary data being protected. A commonly used ratio for tape backup is 10:1, but depending on retention policies and administrative practices, this can grow to upwards of 50:1 in some dysfunctional cases. Now shift to a disk-based backup scenario and consider the impact. Realistically, you'll likely do fewer full backups and more incrementals with disk technology; with traditional compression techniques (assuming your disk-based backup product supports compression), you may "only" need storage for three to six times your primary data.
Can you afford to buy sixfold capacity for backup? Factoring in data growth rates, this is a huge problem. The power and cooling impact alone, not to mention the equipment cost, could make you reconsider. Without a means to address this issue, hard-dollar total cost of ownership analysis would continue to favor tape, and the justification of disk-based backup purchases would be based largely on risk reduction or improved service rather than on cost savings--a much tougher sell to a cost-conscious CFO.
But if backup data required dramatically less storage than primary data, the value proposition would swing heavily in favor of disk. Data-reduction technology makes that tantalizing possibility a reality. It's currently being deployed in a range of products and providing efficiencies of 10 times or more in storage utilization.
Factoring and commonality
Each time a file is modified and backed up, it's likely that most of the data in that file has been previously backed up. Backup vendors are aware of this, and while products like IBM's Tivoli Storage Manager (TSM) and Symantec/ Veritas' NetBackup have options to enable subfile or block-level incremental backups, these have generally been deployed on a limited basis, largely because of the performance impact of restoring many subfiles from tape.
Other vendors have taken the concept of subfile "deltas" and added commonality factoring techniques that extend it much further. Unfortunately, it seems that each vendor has coined a different term to describe its approach, including de-duplication, commonality factoring, single-instance store, data coalescence, capacity-optimized storage, content-addressed storage (CAS), common file elimination and many others.
The goal is to improve backup performance and minimize capacity requirements by capturing only the actual changes, and then minimizing the amount of data backed up and stored. Capacity savings are achieved by recognizing redundant data and avoiding the storage of multiple copies of data.
The data coalescence process examines data at a specified unit of granularity to identify redundancies, indexes common units of data and then stores only additional unique data. Specific approaches vary based on several factors, such as the level of granularity. With a finer granularity (i.e., smaller data units), greater commonality can be discovered and, therefore, greater storage savings can be realized. Indexing at the file level, for example, eliminates storing multiple backups of the same file; but adding one word to a 1MB Word document represents a change that would require another entire file to be saved. Indexing at the subfile or block level would store only the portion of the file with changed data. But greater granularity requires more indexing, which could impact performance when accessing information.
Another consideration is the use of fixed- vs. variable-length data elements. A fixed-element approach may be fine for structured data such as databases, but with unstructured data, many commonalities may be missed. Consider the files shown in the figure "Fixed-block commonality factoring" at right. Using a fixed-length approach, no commonality would be discovered and each file would be stored as a separate set of objects.
Regardless of the level of factoring, the true benefits of a data-reduction technology are realized when aggregated across multiple servers. A single backed-up system has significant redundancy, but when many systems are backed up to a common server, there's likely to be an even greater potential for redundancy and, therefore, for data reduction. Imagine coalescing backup data down to onetenth or one-twenty-fifth of its original total size. Properly deployed, this technology can make disk cheaper than tape, turning the tape vs. disk TCO comparison on its head.
Data-reduction product options
Data-reduction products have been available from different vendors for some time. EMC Corp. put object-based CAS on the map with its Centera product, but the firm draws a sharp distinction by positioning this product for fixed-content data archiving, not backup.
In the backup realm, there are several software and hardware products using data-reduction technology. Here's a brief rundown:
- Iron Mountain Inc.'s Connected DataProtector family of products uses block-level data-reduction technologies to protect laptops and desktops, as well as Windows file servers.
- Storactive Inc.'s LiveBackup products marry two technologies--continuous data protection and data reduction-- to protect laptops and desktops. The company's LiveServ for Exchange product also applies these technologies to protect Microsoft Exchange using less space than an Exchange information store.
- Avamar Technologies Inc. is a pioneer in applying commonality factoring and CAS to backup. Its Axion family minimizes network bandwidth and storage capacity through a full range of client and server applications and appliances designed to supplant the traditional backup environment.
- Data Domain Inc.'s DD400 and DD200 appliances provide an NFS- or CIFS-accessible "capacity-optimized storage" array that can provide disk storage for traditional backup or other applications.
- Asigra Inc.'s Asigra Televaulting for Enterprises uses an agentless backup design and incorporates common file elimination techniques to minimize storage requirements. Designed to eliminate tape, this product enables highly efficient remote replication.
- Among the multitude of disk-based backup technology approaches, virtual tape libraries (VTLs) are among the most prevalent. Diligent Technologies Corp., with its ProtecTier platform, is the first VTL vendor to apply datareduction techniques to virtual tape libraries, providing data reduction of 25 times or more through its HyperFactor technology.
- Software toolkit developer Rocksoft Ltd. offers Blocklets to enable other developers to incorporate data-reduction functionality into their products. Blocklets' technology is called Data Streamlining, and it uses variable-sized, subblock-level commonality factoring to provide reductions of up to 98%.