Simply put, data deduplication is the process of eliminating redundant bits in a storage system. But as a market it is still very much in the growing stage, a multitude of different approaches by different vendors and their products can make investigating data deduplication anything but simple.
Among the vendors there are two essential categories: those that perform data deduplication "in-line" and those that perform it "post-process." In-line data deduplication is performed as data flows into the secondary storage system; post-process deduplication is performed once data is already stored.
The advantage to in-line deduplication is that the process is performed only once. At high enough capacities, some in-line vendors argue that post-process deduplication can exceed backup windows. However, the advantage to post-processing deduplication is that there are no worries about the CPU-intensive deduplication process creating a bottleneck between the backup server and the secondary storage target.
In both cases, experts warn that users shouldn't be too cavalier with disk purchases, especially not in the beginning. "A common misunderstanding is that users will hear that they only need, say, a terabyte to store 10 terabytes (TB) of backups," said W. Curtis Preston, vice president of data protection services at GlassHouse Technologies Inc. "Then they go out and buy a terabyte of disk, only to realize that by definition they need 10 TB for the initial backup," since it's only after that initial backup that bit-level comparisons can be made.
Beyond the in-line vs. post-process debate, there's no shortage of differences -- and further debates to be had -- among different vendors and their approaches to deduplication.
Data Domain Inc. has been shipping product longest and has the largest install base at just over 750 customers. Its appliances, which can be accessed through either a virtual tape library (VTL) or network-attached storage (NAS) interface, range from the branch office-sized DD410 model to the multipetabyte DDX series array. Data Domain performs in-line deduplication and uses the SHA-1 algorithm and a proprietary algorithm as a secondary check. It keeps the comparison index cached in nonvolatile RAM. With Data Domain, an individual data stream is limited to 110 megabytes per second (MBps). The company says it's working on moving to a clustered architecture to aggregate performance, which should be out next year.
Diligent Technologies Corp. offers data deduplication within its ProtecTier VTL product, which is also resold by Hitachi Data Systems (HDS). Diligent performs in-line deduplication by keeping the comparison index in cache on Fibre Channel disk, which it claims makes the process go faster, but could also get expensive. Also in contrast to Data Domain, Diligent uses a proprietary hashing algorithm throughout its deduplication process. Diligent claims better performance numbers than Data Domain, at 400 MBps throughput. Diligent and Data Domain largely target different market segments -- Diligent at the high end and Data Domain in the midrange. Diligent claims 150 customers.
Avamar, founded in 1999, was picked up by EMC Corp. last year for $165 million. It was the first data deduplication company to be acquired by a major vendor. Avamar also performs data deduplication in-band using SHA-1, but does so at the source (the backup server), rather than at the backup target. It uses a central management node to keep track of data for comparison over the whole environment, but does the deduplication in small chunks at each server before it's sent over the network to the backup target. As such, Avamar's deduplication can also reduce network congestion in addition to reducing data at the secondary storage target. Avamar's deduplication product requires the replacement of the backup environment. EMC has stated plans to incorporate it into its Legato portfolio and its VTL by next year.
ExaGrid Systems Inc.'s post-process data deduplication comes as part of its NAS backup appliance. Unlike other data deduplication products, ExaGrid does comparisons at the byte level rather than the bit level, claiming this makes for simpler hash tables, better scalability and leaves less room for bit-level fragmentation errors. ExaGrid's product is also "content aware," which means it understands the common data patterns in major backup software products and can find duplicates accordingly.
FalconStor Software Corp.'s Single-Instance Repository (SIR) feature on its VTL and IPStor product lines has yet to make a full-fledged appearance on the market. The post-process product uses the IPStor virtualization engine and the SHA-1 algorithm (with a secondary check using the MD5 algorithm) to create a separate deduplicated repository for long-term archive data after it is backed up to the VTL. IBM and Sun Microsystems Inc. both OEM the VTL product, though IBM does not offer SIR, and Sun will not offer it until later this year.
Quantum Corp. folded in IP, acquired with Advanced Digital Information Corp. (ADIC) last year, into the DXi3500 and DXi550 appliances in December. The in-line VTL-based deduplication product uses a patented algorithm belonging to ADIC subsidiary RockSoft. That deduplication has also recently been added as feature within Quantum's StorNext filesystem, also from the ADIC acquisition, which claims to be an all-in-one data migration and management engine.
NEC Corp. of America, a subsidiary of NEC Corp.,Japan, offers data deduplication as a feature within its HydraStor grid backup appliance, released in March. HydraStor's proprietary deduplication technology, dubbed DataRedux, eliminates data duplication at the subfile level across and within incoming data streams. With HydraStor's grid architecture, controllers are added as capacity is added and every node is aware of every other node, easing performance and management issues sometimes associated with in-line products. NEC claims it reduces storage capacity by up to 75% without interrupting performance.
Network Appliance Inc. (NetApp) announced general availability of block-level data deduplication within its NearStore R200 and FAS storage systems on May 15 after beta testing it in customer environments for the first quarter of this year. The data deduplication development is based on NetApp's Advanced Single Instance Storage (A-SIS), from its SnapLock product. NetApp used a feature of its Write Anywhere File Layout (WAFL) to add A-SIS to its filers. WAFL already calculates a 16-bit checksum for each block of data it stores. For data deduplication, the hashes are pulled into a database and "redundancy candidates" that look similar are identified. Those blocks are then compared bit by bit, and if they are identical, the new block is discarded. The license key is free for NearStore users and will deduplicate data at the block level on primary storage, which makes it unique among data deduplication schemes. However, NetApp still has yet to add the capability for its VTL, citing performance concerns.
Sepaton Inc. offers data deduplication on its S2100-ES2 VTL through a software option called DeltaStor. The post-process deduplication uses a proprietary "content-aware" algorithm. Sepaton's claim to fame so far in the data deduplication world is the fact that it uses a process called forward referencing, while other products use reverse referencing. Reverse referencing creates a pointer to the original data if there are further occurrences of the original; forward referencing writes the latest version of the data and makes the previous occurrences a pointer to the most recent version. Sepaton claims this method makes restores quicker by keeping the most recent backups intact, since more recent backups are the ones most likely to be restored as a general rule.
Symantec Corp. has a product most comparable to Avamar, a software feature called PureDisk it's currently integrating with its NetBackup software. Like Avamar, the product uses a proprietary algorithm to deduplicate data in-line and at the source. The latest version of NetBackup, 6.2, supports PureDisk to tape targets and integrates PureDisk into the Backup Reporter backup monitoring tool. Version 6.2 also supports failover between multiple PureDisk servers. The next big release for NetBackup, version 6.5, slated for announcement in June, will offer even more integration between NetBackup and PureDisk, according to early reports.