Data deduplication FAQ

Will you be able to achieve the huge data reduction ratios that data deduplication vendors are touting? In this FAQ guide, analyst Jerome Wendt answers the most common data deduplication questions he's hearing today.

This Content Component encountered an error

Will you be able to achieve the huge data reduction ratios that data deduplication vendors are touting? In this FAQ guide, analyst Jerome Wendt answers the most common data deduplication questions he's hearing today. 

Table of contents:

>>Data reduction ratios
>>Data deduplication appliances and backup software
>>Data deduplication products to watch for
>>Achieving data reduction ratios
>>When to deduplicate data
>>Data deduplication using backup software on the host
>>Data deduplication benefits
>>Target data deduplication on a virtual tape library (VTL)
>>Scalability issues
>>Growth issues
>>Data loss from hash collisions

What type of data reduction ratios should you realistically expect using deduplication?

There are two ways you can look at it. Most people are looking at data deduplication in conjunction with backup; backup appliances performing deduplication or even VTLs. So, looking at it in that context, you'll hear advertised ratios of anywhere from 10 to 500X.

But, realistically, I think it's safe to assume a ratio of anywhere between 13 to 17X. You'll probably see lower ratios on target-based deduplication, and you'll see higher ratios on source-based deduplication just because of how they are architected.

Now the flip side of that is the kind of data reduction ratios you might see in an archiving environment. I recently had a chance to speak with companies like Permabit and NEC that are seeing their appliances deployed more in that context. They're seeing ratios from as small as 2X to as large as 200X. So again, your mileage will vary; a ratio of 13X to 15X is a good rule of thumb, but you could see extraordinary results, if you are in the right kind of environment.

Which data deduplication appliances and backup software do you view as enterprise ready?

If you have 20 TB of data to backup, I would say most data deduplication appliances on the market today can handle that without too much of a problem. When you start scaling to 50 TB, 100 TB, 1 PB of backup data, the dynamics really change. You have backups going on. You have recoveries going on. You have data being moved off to tape. It just becomes a much more complex, dynamic environment.

Some of the technologies I personally think are enterprise-ready are Dilligent Technologies and Sepaton. Both companies have had products on the market for some time now, and they are both having pretty good success in the market.

Are there any data deduplication products that listeners should watch for?

Permabit has a really interesting appliance, and hopefully they can pull it together and get their message out there because they have a robust product that's been around for a few years. It's probably more mature than a number of the products out there.

The other product I'm really fascinated by is the NEC Hydrastor. It just had its first product release at Storage Decisions in New York City earlier this year, and I've talked to a couple customers that had good stories to share about it. It's got a really solid architecture, but it's so new there hasn't been time for users to start using and evaluating it so NEC can work any kinks out that emerge and get it ready for the enterprise.

How long does it take for companies to achieve these data reduction ratios?

This is another area where your mileage is really going to vary. I had a pretty extensive conversation about this with Network Appliance a while back. Assuming a 5% change in data from the previous week's full backup, we calculated that over 90 days you'll see a data reduction ratio of about 20X. It takes about 90 days to see any significant reduction.

In the short term, you might see a reduction of 2X or 3X over the course of the first month or so. But the longer you keep the data deduplication, that's when you start to see the larger numbers.

Do you think data should be deduplicated while the backup is occurring or after it is complete?

There are two approaches to this. There's the post-processing architecture that accepts all the data incoming and then stores the data on disk. Then, there is the more common in-line architecture.

Personally, I'm more in the post-processing camp right now. I'm trying to fully understand the benefits of doing it in-line. My concern at the enterprise level is that these products can do consistent restores, backups and offloads to tape. And, that the product can do these things while maintaining performance.

From a tactical perspective, I like the post-processing approach. As long as you keep buying disk, you can keep doing backups. It may not be as elegant or nicely designed as the in-line approach, but you can always do the backup and the data can be deduplicated later.

However, my opinion is in a state of flux in this area.

When does data deduplication using backup software on the host make sense?

There are a couple factors that you really need to consider. If you're bandwidth constrained and are trying to back up data and you have large amounts of data coming over the network, then using data deduplication at the host makes a lot of sense. That can dramatically free up the amount of bandwidth that you have.

It is important that the host can sustain the initial hit. This technology requires memory and CPU processing to perform the data deduplication. It might be a good idea to run the initial backup over a weekend when the backup window is a bit longer.

Some companies are taking steps to mitigate this initial performance hit. I recently talked to Symantec, and, for that initial backup, they are putting some of the intelligence out on the individual nodes so there is some level of deduplication taking place before it takes off to help reduce some of the overhead.

Are there any instances where data deduplication will not provide any benefits? Superior benefits?

With photos, videos, etc., there's not a lot of duplicate information. If there are a lot of new images being created, then you'll see very little benefit from data deduplication. In that case, you're better off just running differential or incremental backups.

As far as areas in which data deduplication can provide superior benefits, databases have a lot of redundant data and deduplication can have dramatic results. Also, large file systems with small numbers of changes are a good match for deduplication.

How about when doing target data deduplication on the VTL or backup appliance? Is there a similar performance impact?

It has a little bit different performance impact. The knock on target-based data deduplication is that you aren't really reducing the amount of data you are sending over the network. So you are still sending the full amount of data over the network every time you do a backup. You are reducing the amount of data that you store, but there is always going to be that performance hit with target-based deduplication.

There were early concerns about scalability. Have manufacturers largely dealt with those concerns, or do scalability issues still remain?

There are still concerns about scalability. Up to 20 TB you are probably OK. Once you get up over that, you really need to take a look at how the product is architected, how to manage performance and scalability, and even data destruction on the back end. How does the product remove expired data from the index? How is the index rebuilt over time? How do you add more capacity? It really becomes a much more complex proposition at the enterprise level.

So, these issues are still there, and companies need to be aware of them. On the plus side, this technology is so hot right now, and vendors are throwing resources at it trying to bring their products up to snuff. They know that companies want this, and they know it's a huge problem. Expect a lot of changes in the next 12 months.

There were early concerns about scalability. Have manufacturers largely dealt with those concerns, or do scalability issues still remain?

There are still concerns about scalability. Up to 20 TB you are probably OK. Once you get up over that, you really need to take a look at how the product is architected, how to manage performance and scalability, and even data destruction on the back end. How does the product remove expired data from the index? How is the index rebuilt over time? How do you add more capacity? It really becomes a much more complex proposition at the enterprise level.

So, these issues are still there, and companies need to be aware of them. On the plus side, this technology is so hot right now, and vendors are throwing resources at it trying to bring their products up to snuff. They know that companies want this, and they know it's a huge problem. Expect a lot of changes in the next 12 months.

What happens if you fill up a storage system with deduplicated data and it needs to grow?

This is really the Achilles heel of data deduplication right now.

If you have a lot of growth and wind up filling up an entire system, most products on the market today don't have the ability to add another system behind it. So, you have to invest in a new system. And that leads to another problem. If you get a system that's the same size, you can't simply add the previously deduplicated data to the new system, you have to start the whole deduplication process all over again. If you spend the money to get a much bigger appliance, you can migrate all the data over and start building from there.

At the enterprise level, you are looking for a product that can scale performance and storage independently.

There are also concerns about data loss due to hash collisions. How have manufacturers addressed that issue today?

I've had a chance to speak to a number of vendors about this, and for the most part they say that this is pretty much resolved with the newest hashing algorithms. I was talking to Permabit about this subject, and it said that the only way you can be absolutely sure that each chunk of data is completely unique is to take each new chunk of data and compare it to every other chunk of data stored. It also said that's really not practical because if you do that, it creates a huge time delay. So, the compromise is to use a hashing algorithm.

Other companies take a different approach. ExaGrid, for example, cuts data into large segments and analyze contents of each segment to see how they are related to each other, and then it performs byte-level differencing on each segment and stores data that way.

Jerome M. Wendt is the founder and lead analyst of The Datacenter Infrastructure Group, an independent analyst and consulting firm that helps users evaluate the different storage technologies on the market and make the right storage decision for their organization.

This was first published in October 2008

Dig deeper on Data reduction and deduplication

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSolidStateStorage

SearchVirtualStorage

SearchCloudStorage

SearchDisasterRecovery

SearchStorage

SearchITChannel

Close