The pros and cons of data deduplication tools

Let us help you decide what type of deduplication tool is right for your organization, and the pros and cons of different deduplication technologies. Assistant Editor John Hilliard spoke with Rachel Dines, an analyst with Forrester Research in Cambridge, Mass., to talk about deduplication tools and innovations coming from the dedupe space. Learn about different software vs. hardware deduplication tools, inline vs. post-processing deduplication, and the pros and cons of global deduplication. Check out our podcast about deduplication tools or read the transcript below.

Play now:
Download for later:

Rachel Dines talks about deduplication tools

  • Internet Explorer: Right Click > Save Target As
  • Firefox: Right Click > Save Link As







Let’s talk about the benefits and drawbacks of deduplication software, vs. a dedupe appliance. When is software dedupe a better choice than an appliance and vice versa?

The main benefit with software deduplication is being able to manage the deduplication out of the same portal you’re managing the backup. In addition, you can do target, you can do source-side, you get a little more flexibility on how you do the deduplication. As of yet, there are not too many offerings out there that allow you to do both source and target, although there are a few. But you do get that with software deduplication.

You can use whatever disk you want on the backend, [but] the major drawback to me on the software-based solutions is that they’re slower and less scalable. To this point, most of the software-based solutions just can’t match the speed and scale of the hardware deduplication tools.

I do think the software solutions are going to get there eventually; those products will catch up and give the hardware solutions a run for their money, but at this point, for large companies looking at massive amounts of backup, a lot of the software solutions just aren’t going to be able to handle that at this point.

Organizations also have the options of source and target deduplication. What are the pros and cons of each method of dedupe?

The main benefit you get out of source-side deduplication is that there is less data being backed up to begin with, so it really helps with network congestion. It can also really help with backup windows as well. The main issue with source-side deduplication is that it can have some overhead on your host. So if you have a host that is running hot, that is running at almost peak utilization, putting another agent on there that’s going to run deduplication is going to be a challenge.

I see people using source-side deduplication a lot in virtual environments and it works great there; it works great in file systems; and it works great in certain applications. In other areas, potentially transactional databases, applications that run hot, it is not going to work quite as well. In general, source-side deduplication is a great tool for backing up virtual environments, file systems and branch offices as well. If you don’t want to put any hardware at the branch office, you can just deploy source-side agents and back up all data over the WAN.

With target-side deduplication, you do have to transfer all the data to the device or the media server before you start deduplication, so you don’t get that reduced bandwidth, but you transfer all of the processing over to the appliance. So that would be the main use case for target-side deduplication.

What about the pros and cons of inline dedupe and post-process dedupe?

Inline and post-process [dedupe] both talk about different kinds of target deduplication. And what that means is, with inline deduplication, the data is processed immediately as its ingested. With post-process, the data is deduplicated after the data hits the disk. It doesn’t necessarily have to wait for the backup job to complete, but it will run as soon as the data hits the disk. The main issue to consider with post-process deduplication is you do need a little more storage as overhead, because you’re storing that data and then deduplicating it.

In general, though, the major benefit is it is slightly faster at this point. Post-process deduplication used to be a lot faster than in-line, and now it has pretty much caught up. Also, you can pick and choose what you might want to deduplicate and what you don’t. Because it doesn’t have to deduplicate on the ingest, you can have certain workloads that are deduplicated, and some that aren’t. With inline, everything that passes through has to be deduplicated.

And global dedupe – is it only important for large data sets?

With global deduplication, there’s a single deduplication index that can span over multiple system components, multiple nodes, multiple systems potentially. Local deduplication means that the index is only on a single component. If you’re buying multiple nodes of a deduplication appliance, can it deduplicate across those nodes? This becomes part of your scale-out approach for your deduplication appliance.

What are the most significant developments in the dedupe space in the past year or so?

Deduplication as an algorithm itself really hasn’t changed in a long time. What has been going on is that deduplication appliances have been getting faster and been getting larger. We’re starting to see in the tens and twenties in terabytes per hour ingest speed, which would have been unheard of a few years ago.

That ability to integrate and potentially move the deduplication index into the software is really interesting. This is basically the ability of taking the EMC Data Domain deduplication index and moving it into the software, and moving that workload into the media server or potentially the agent.

I think we will be seeing more of that type of solution out there, and I think that’s one of the most exciting things that’s been happening in this space.

Dig Deeper on Data reduction and deduplication