The pros and cons of globally deduplicating data backup appliances

Global data deduplication is a way for companies to achieve higher data reduction ratios, use less storage capacity and keep data online for longer periods of time.

Data deduplication appliances are often deployed under the premise of expediting backups while keeping large amounts of data on disk. But as retention periods for data grow and companies look to keep their archived and backup data stores together, individual data deduplication appliances can create new sets of scaling, scope and management problems. To overcome these issues, global deduplication is emerging as a way for companies to achieve higher data reduction ratios and use less storage capacity while giving companies more flexibility to keep data online for longer periods of time.

Appliances that provide global deduplication can provide the following key benefits:

  • Create smaller storage footprints
  • Decrease network bandwidth requirements for data replication
  • Eliminate data silos
  • Lower storage costs
  • Simplify and centralize the management of deduplication appliances

The choice of any data deduplication appliance depends on the amount of data that a company anticipates storing and how many geographic locations it needs to protect. Companies with only one location with combined deduplication archive and backup stores of fewer than 20 TB will find that almost any deduplication appliance will meet their needs. It is when companies grow their deduplicated data stores beyond 20 TB or need to protect multiple sites that the need for a backup appliance that supports globally deduplicated data stores becomes more evident.

Global deduplication for ROBOs

The most prevalent way that global deduplication is implemented is as part of a company's scheme for protecting remote and branch offices (ROBOs). Configured in a hub-and-spoke architecture, deduplication appliances are deployed at each ROBO, usually with a larger, master deduplication appliance located at the home office.

The global deduplication only occurs after the data in each ROBO is backed up, deduplicated and stored on the appliance at its site. At regularly scheduled intervals, either nights or weekends, the deduplicated data at each ROBO is replicated back to the master backup appliance in the home office.

To minimize the amount of data replicated back to the home office, an index of the deduplicated chunks of data at the ROBO is first sent to the master backup appliance in the home office. The master backup appliance then compares this list to its own, larger index to identify which chunks of data it already has in its data store.

After this is completed, the master appliance creates a list of the chunks it doesn't have and sends that back to the appliance at the ROBO. This helps to minimize the amount of data that needs to be sent, the amount of network bandwidth required and the length of time it takes to complete. Products that support these types of hub-and-spoke global deduplication configurations include Data Domain's DDX Arrays, EMC Corp.'s new 3D disk libraries, NEC's Hydrastor and Quantum Corp.'s DXi-Series of backup appliances. Appliances from ExaGrid Systems Inc. and Sepaton Inc. have similar but more limited global deduplication features.

It's important to note that, in enterprise environments, global deduplication appliances can have capacity and performance limitations. These limitations may make themselves evident in the following ways:

  • The amount of data to deduplicate exceeds the capacity of the master backup appliance. In these circumstances, companies may need to purchase a larger appliance and migrate all of the data to the new appliance or purchase a second appliance. If a second backup appliance is purchased, verify it can access the deduplicating index created by the first index. If not, it needs to start deduplicating data from scratch. This creates a separate data silo and recreates the problem that global deduplication was initially intended to solve.

  • The master backup appliance has insufficient processor and memory to support all of the replication and global deduplication functions. The master backup appliance may have to concurrently receive and deduplicate data from ROBOs while handling incoming backup streams and deduplicating them in the home office. The combination of managing all of these jobs on a nightly basis could extend backup windows while slowing the deduplication and replication of data of edge appliances to the master appliance.

Global deduplication allows companies to consolidate and centralize their deduplicated data. However, all data deduplication appliances are not created equal. Companies looking to deduplicate and then replicate data from their ROBOs back to a central site need to ensure that the appliances in the central site can scale in performance and capacity to meet their global deduplication needs and provide the granularity of control that they need when replicating data. However, properly implemented, global deduplication can centralize and enhance corporate-wide data management, protection and recovery options while minimizing data stores and related storage costs.

About the author: Jerome M. Wendt is lead analyst and president of DCIG Inc.

Dig Deeper on Data reduction and deduplication