Column

Why global deduplication is important in backup systems today

Requires Free Membership to View

By W. Curtis Preston

Global data deduplication removes redundant data when backing up data to multiple deduplication devices. With global dedupe, when data is sent from one node to another, the second node recognizes that the first node already has a copy of the data, and doesn't make an additional copy.

Global deduplication is also the most important feature missing from target deduplication systems today. The lack of this feature increases both the acquisition cost and the operational cost of such systems. The first step in understanding the value of global dedupe is to understand why people often confuse the term.

Hash-based deduplication systems

Consider the hash-based deduplication systems shown in "Table 1: Hash-based and delta-based deduplication products" below. Hash-based deduplication slices data into chunks, create a hash for that chunk, and then perform a hash table lookup to see if they have ever seen that hash before. Delta-based deduplication systems (also shown in Table 1) compare an entire backup (e.g., saveset, image, dump) to another backup that it is similar to. For example, they look for the delta between the most recent full backup of Elvis to the previous full backup of Elvis.

Vendors of hash-based dedupe products often tout how they compare every incoming backup to every other backup they have ever seen, saying that their dedupe is more "global." However, delta-based vendors tout how their dedupe is more granular. Hashing is more efficient with dissimilar data; deltas are more efficient with similar data. "Bake-offs" that compare the deduplication ratio of the two methods often result in a draw.

Since hash-based vendors consider their dedupe to be more global, some representatives of those vendors refer to what they do as global deduplication. The matter is further complicated by EMC Data Domain, which often uses the term global compression to describe what it does. (In all fairness, they were using the term long before the term deduplication came into common usage.) However, this is not what we mean when we use the term global deduplication.

The concept of global dedupe comes into play when you purchase multiple dedupe appliances (i.e., nodes). If a vendor supports global dedupe (also known as multi-node dedupe), they will have a cluster or grid of multiple nodes that work together as one. Data sent to one node in the grid is compared to previous data sent to that appliance, and to data sent to any other node in that grid. This allows the customer to load balance backups across all nodes in the grid, while being assured that data common to more than one node will only be stored on one node. All known source deduplication systems support global dedupe; it is with target deduplication systems that this discussion is most relevant.

As you can see in Table 1, most target deduplication vendors offer some level of global deduplication. The more nodes a vendor supports in a globally deduped cluster, the easier things will be for their customers. NEC Corp. supports 55 nodes, Exagrid Systems Inc. supports 10 nodes, Sepaton Inc. supports eight nodes, FalconStor Software supports four nodes, IBM Corp. supports two nodes, and EMC Data Domain supports two nodes, but only with its fastest system (the DD880) and only for NetBackup OST customers. Quantum Corp. is the only target dedupe vendor to offer no support for global deduplication.

Table 1: Hash-based and delta-based deduplication products

Vendor

Hash/Delta

Global dedupe support

EMC Data Domain Global Deduplication Array

Hash

2 nodes, NBU and OST only

EMC Data Domain (All ofther products)

Hash

No

Exagrid Systems Inc. EX Series

Delta

10 nodes

FalconStor VTL

Hash

4 nodes

IBM ProtecTier

Delta

2 nodes

NEC Corp. HydraStor

Hash

55 nodes

Quantum Corp. DXi

Hash

No

Sepaton

Delta

8 nodes

How global deduplication lowers operational and acquisition costs

The easiest thing to understand is how global dedupe lowers operational costs. If one can treat all of their global dedupe nodes as one grid, and load balance all of their backups across all nodes in that grid, configuration of the backup system should be extremely easy. However, if you purchase multiple nodes that don't work together, you must create and maintain multiple subsets of your backups, and point each subset at only one node. The creation and maintenance of these backup subsets makes the backup system harder and costlier to maintain, and the requirement to point each subset to only one node makes the backup system less reliable. The latter also increases management cost due to manual workarounds that must be performed in the case of a node failure.

The existence of global deduplication also reduces acquisition cost for three reasons. First, consider the differences between the type of nodes that are used to build ExaGrid, FalconStor, NEC and Sepaton, and the nodes that are used to build Data Domain, IBM, and Quantum systems. The former are able to use less expensive nodes because their performance is not capped at one or two nodes. The latter tend to use much more expensive CPUs and RAM because the power of each node is what drives their performance numbers. Buying the latest and greatest CPUs and components is more expensive than buying the previous generation.

The second reason that global deduplication reduces acquisition cost is that it allows a customer to buy today what they need today, and buy tomorrow what they need tomorrow -- without throwing anything away they bought yesterday. Customers buying target dedupe systems that do not have global dedupe (e.g., Quantum), or who offer it only for two nodes (e.g., IBM, EMC Data Domain), or only for certain customers (e.g., EMC Data Domain) are faced with a very different proposition. Because they cannot grow their current system by simply adding more nodes, they are forced to do one of three things.

The final reason that global deduplication helps reduce acquisition cost is that it ensures that all backups are compared to  previous backups.

 

,

Their first choice is to buy today what they need tomorrow. For example, a customer that needs a Quantum DXi 7500 today will probably be advised to buy a DXi 8500 instead, even if they won't need its performance or capacity for a year or more. In the world of ever decreasing cost of disk, buying today what you need tomorrow will always cost you more money. Data Domain customers have a slightly different choice. If a customer needs a DD 690 today, but grows into a DD 880 in a year or so, can purchase just the head of the DD 880 appliance, and replace the DD 690 head with the DD 880 head, while keeping the disk they already purchased. The DD 690 head will, of course, go to waste. Finally, a customer could "upgrade" any of these systems by simply purchasing another system and using both of them side-by-side. Besides the operational cost mentioned above of performing backups with multiple discrete nodes that don't work together, there is also the acquisition cost created by the difficulty of properly sizing each node. Since it is impossible to get right, and under sizing such a system would be even worse, resulting in even more waste, customers will tend to oversize such systems.

The final reason that global deduplication helps reduce acquisition cost is that it ensures that all backups are compared to previous backups -- regardless of which node they were sent to. Since backup systems are always in a state of flux and customers must constantly change their backup policies in order to meet various needs, it would be nice if at least they do not have to worry about messing up their deduplication ratio by changing their backup configuration. Global deduplication should ultimately result in the best deduplication ratio; it also reduces the amount of disk the customer must purchase.

Global, or multi-node, deduplication reduces the acquisition and operational cost of target-based deduplication systems. While it isn't the only decision factor in the complicated job of choosing such a system, it certainly should be a high-priority feature.

About this author: W. Curtis Preston (a.k.a. "Mr. Backup"), executive editor and independent backup expert, has been singularly focused on data backup and recovery for more than 15 years. From starting as a backup admin at a $35 billion dollar credit card company to being one of the most sought-after consultants, writers and speakers in this space, it's hard to find someone more focused on recovering lost data. He is the webmaster of BackupCentral.com, the author of hundreds of articles, and the books "Backup and Recovery" and "Using SANs and NAS."


This was first published in December 2010

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: