Appliances that provide global deduplication can provide the following key benefits:
- Create smaller storage footprints
- Decrease network bandwidth requirements for data replication
- Eliminate data silos
- Lower storage costs
- Simplify and centralize the management of deduplication appliances
The choice of any data deduplication appliance depends on the amount of data that a company anticipates storing and how many geographic locations it needs to protect. Companies with only one location with combined deduplication archive and backup stores of fewer than 20 TB will find that almost any deduplication appliance will meet their needs. It is when companies grow their deduplicated data stores beyond 20 TB or need to protect multiple sites that the need for a backup appliance that supports globally deduplicated data stores becomes more evident.
Global deduplication for ROBOs
The most prevalent way that global deduplication is implemented is as part of a company's scheme for protecting remote and branch offices (ROBOs). Configured in a hub-and-spoke architecture, deduplication appliances are deployed at each ROBO, usually with a larger, master deduplication appliance located at the home office.
The global deduplication only occurs after the data in each ROBO is backed up, deduplicated and stored on the appliance at its site. At regularly scheduled intervals, either nights or weekends, the deduplicated data at each ROBO is replicated back to the master backup appliance in the home office.
To minimize the amount of data replicated back to the home office, an index of the deduplicated chunks of data at the ROBO is first sent to the master backup appliance in the home office. The master backup appliance then compares this list to its own, larger index to identify which chunks of data it already has in its data store.
After this is completed, the master appliance creates a list of the chunks it doesn't have and sends that back to the appliance at the ROBO. This helps to minimize the amount of data that needs to be sent, the amount of network bandwidth required and the length of time it takes to complete. Products that support these types of hub-and-spoke global deduplication configurations include Data Domain's DDX Arrays, EMC Corp.'s new 3D disk libraries, NEC's Hydrastor and Quantum Corp.'s DXi-Series of backup appliances. Appliances from ExaGrid Systems Inc. and Sepaton Inc. have similar but more limited global deduplication features.
It's important to note that, in enterprise environments, global deduplication appliances can have capacity and performance limitations. These limitations may make themselves evident in the following ways:
- The amount of data to deduplicate exceeds the capacity of the master backup appliance. In these circumstances, companies may need to purchase a larger appliance and migrate all of the data to the new appliance or purchase a second appliance. If a second backup appliance is purchased, verify it can access the deduplicating index created by the first index. If not, it needs to start deduplicating data from scratch. This creates a separate data silo and recreates the problem that global deduplication was initially intended to solve.
- The master backup appliance has insufficient processor and memory to support all of the replication and global deduplication functions. The master backup appliance may have to concurrently receive and deduplicate data from ROBOs while handling incoming backup streams and deduplicating them in the home office. The combination of managing all of these jobs on a nightly basis could extend backup windows while slowing the deduplication and replication of data of edge appliances to the master appliance.
About the author: Jerome M. Wendt is lead analyst and president of DCIG Inc.
This was first published in June 2008