Reduce storage costs through data deduplication

Data deduplication reduces the amount of data, thereby reducing storage costs. This article explores the different approaches to data deduplication.

As the volume of data in an organization grows, the amount of repeated data takes a toll on storage availability.

For example, a 10 MB PowerPoint presentation copied to 100 users will require 1 GB of storage for the attachments on an Exchange server. The problem gets worse when that 1 GB of duplicated storage is backed up every week. After a year, that 1 GB of wasted space can ultimately demand 52 GB on tape or other backup storage.

Data deduplication technology has emerged to combat the problem of repetitious data. With data deduplication, only one iteration of a file, block or byte is saved to the actual storage media.

Data deduplication offers several benefits. Data deduplication can achieve data reduction levels ranging from 10 to 1 to 50 to 1. With less storage needed, storage costs are reduced, because this means fewer disks and less frequent disk purchases. Less data also means smaller backups, which translates into smaller backup windows and faster recovery time objectives (RTO). The smaller backups also allow for longer retention times on virtual tape libraries (VTL) or archives.

But for deduplication to be effective, data must be held long enough so that a comprehensive index of data develops to deduplicate against. Deduplication is pointless with data that is only kept for a week.

Deduplication essentials

Data deduplication (also called intelligent compression or single-instance storage) scans data for repetitious content. At the simplest level, this means locating multiple copies of the same file. But deduplication only works for identical data, so two files that differ by just a few bits will still be considered different.

Today's data deduplication can go much deeper to locate repetitious instances of blocks or bytes, thereby yielding greater storage savings. When the data is actually moved to a backup, archive or replication platform, only the first instance of that data is committed to disk. Subsequent instances are simply denoted with a small stub that references the saved iteration.

Each piece of deduplicated data is processed using a "hash algorithm" such as MD5 or SHA-1, or sometimes a combination of the two. This hash algorithm returns a designation that is unique to each piece of data, and the hash is stored in an index. When another piece of data is processed, its hash result is compared to other indexed results. If the current result already exists in the index, that piece of data is a duplicate, so the new data is not saved. Instead, only a "stub" to the existing data is inserted.

Deploying deduplication

Data deduplication can be implemented as hardware appliances or software products. Either implementation can take on various forms, as vendors try to differentiate themselves in this emerging marketplace.

Deduplication can be performed in-band, deduplicating data while it's being written to storage. It can also be performed out-of-band as a separate or secondary process. The in-band process can be more efficient but may be slower because the additional processing required at storage time could impact the backup window. The out-of-band process won't impair performance, but will use slightly more disk space and may cause some disk contention during deduplication. Storage administrators should test several deduplication approaches to determine how each works in their particular environment.

Hardware-based implementations tend to be more expensive, but typically perform better and are easier to deploy. Data Domain Inc. offers a DD410 hardware appliance for branch offices and a DDX series array. Quantum Corp. offers its DXi3500 and DXi550 appliances. When selecting a hardware appliance, be sure it's compatible with your current backup software and that it will support your current storage volume (e.g., covers up to 20 petabytes [PB]).

Deduplication is also built into several storage products, including the ProtecTier VTL from Diligent Technologies Inc., the network attached storage (NAS) backup appliance from ExaGrid Systems Inc., the HydraStor grid backup appliance from NEC Corp. of America, the NearStore R200 and FAS storage systems from Network Appliance Inc. (NetApp) and the S2100-ES2 VTL from Sepaton Inc.

When deduplication is software-based, deduplication is generally performed at the backup server (the source) rather than the backup target (the storage system). This eases network congestion between the backup server and storage system and can be handy when backing up across a WAN. EMC Corp.'s Avamar product and Symantec Corp.'s NetBackup offer software-based deduplication. Software-based deduplication is often less expensive than hardware appliances but involves the use of agents on each system to be backed up, which can increase management/maintenance overhead for IT.

When considering deploying deduplication, scalability should be a concern. You should understand how storage performance changes as the data deduplication system grows. For example, very large hash index tables may hurt performance. All deduplication vendors are taking steps to address scaling performance issues.

Dig Deeper on Data reduction and deduplication