Source-based deduplication tutorial

Learn how one manufacturing firm decided on source-based deduplication, and how you can determine if source dedupe is right for your needs.

By Andrew Burton

Julian Cooper, senior IT administrator at Integrated Control Corp (ICC), recently deployed source-based deduplication after an overhaul of the medium-sized businesses' backup strategy. The company was performing backup to tape but was struggling with slow restores and failed backups. After moving to a disk backup approach, Cooper began looking for ways to reduce backup data, and considered both data deduplication and archiving.

In this source deduplication tutorial, learn about whether or not source dedupe can help your organization shorten your backup windows. Learn about source deduplication products, source vs. target deduplication, and the pros and cons of each.


Choosing source-based deduplication
How to use source dedupe
Source vs. target deduplication
Additional data deduplication resources


Most of the leading data backup applications now include source deduplication, including CA Inc.'s ArcServe Backup, CommVault's Simpana, EMC Corp.'s Avamar, IBM Corp.'s Tivoli Storage Manager (TSM), and Symantec Corp.'s Backup Exec and NetBackup. ICC is a Symantec Backup Exec shop, and they back up to a Dell PowerVault MD1000 direct-attached storage (DAS) array onsite and use the Symantec Online Backup Service for off-site backup. "We were backing up between 350 GB and 375 GB for full backups," Cooper said. "When we started deduping, we saved about 50 GB [on a full backup], so that was a huge benefit."

And, though they were using Backup Exec before they decided to run deduplication, Cooper took his time evaluating how running source deduplication would affect the rest of his environment. The decision to use it wasn't as simple as just turning it on. He said, "I wish it was that simple. I wish [products] always just worked. Then it'd be so easy, like 'oh new product? This is great.'"

"It was a combination of things. What's the cost? What's the learning curve? What's the cost of potentially expanding out my servers? In the end, it made more sense to use what we had more efficiently rather than buying more or bigger systems," he added.

Editor's Tip: For more information, learn how source data deduplication can ease remote-office backup and recovery.


As the name implies, source dedupe products process deduplication on the server running the backup software before sending data across a network to the backup target. This is a particularly compelling benefit for users looking to alleviate bandwidth constraints. For example, a company might choose source-based dedupe when backing up a remote office to a central data center, reducing the amount of data they have to send across the WAN. This is a major driver for source dedupe today, according to Jeff Boles, senior analyst with the Taneja Group. "If you've only got a couple machines at a remote office, maybe you don't see a need to invest in an expensive Riverbed-type WAN optimization appliance," he said.

Within the data center, reducing data at the source can take a lot of strain off your local network. This can be particularly useful in virtualized environments. "If you look at the data within a virtual machine disk file, there is a lot of redundancy across virtual machines," said Lauren Whitehouse, senior analyst with Enterprise Strategy Group. "If you look at a physical system you have your operating system running once and then whatever applications and data. On a host system running multiple virtual machines, that operating system exists multiple times."

"However, there is a tradeoff, said Whitehouse. "The I/O processing to perform the deduplication might put a strain on the physical server being shared by the virtual machines." She went on to say that while you might see resource contention while the backup is running, deduplicated backups take considerably less time to complete. So, it's a matter of weighing one against the other.

Cooper is currently evaluating just that in ICC's environment. He has deployed server virtualization, using Microsoft's Hyper-V platform, but has not virtualized tier-one critical applications yet. "The main thing I want to see is the performance impact of deduping on the virtual system itself," he said. "Anything you add on to the virtual system takes resources, so you have to find the balance between the benefits you see and the strain you are putting on the server." On the flip side, "[source-based deduplication] reduces the horsepower you need in your storage destination, because the target is not responsible for processing the deduplication," said Boles.

Cooper also said that he uses Symantec's backup reporting tools to further optimize backups. "The reporting makes things so much easier when, say, we have an extra 40 gigs in a month," he said. "What's included the backup? Is it an anomaly? Is it natural growth? Can we archive some of this? It helps us understand what's going on in our environment."

Editor's Tip: For more information, listen to this podcast on implementing data deduplication in a virtual environment.


Certain organizations are not be able to live with the performance hit on physical servers that comes with source dedupe. "If you have a high performance, processing intensive environment that doesn't have a lot of downtime, you might be hard pressed to implement source dedupe gracefully," said Boles. That type of environment is generally better served by a target-based system such as Quantum Corp.'s DXi series, IBM's ProtectTier, NEC Corp.'s Hydrastor series, FalconStor Software Inc.'s File-interface Deduplication System (FDS), or EMC's Data Domain series.

Target deduplication may also be a better fit for an organization running a backup application that does not have built-in deduplication capabilities or running multiple backup applications. Some organizations may find that source makes sense for some backup jobs and target makes sense for others. Today, some backup software vendors, such as CommVault and Symantec, are responding to this need in the market, offering products that can perform deduplication at the source or at the backup target. However, these products are just emerging.

Finally, there is a perception that source-based deduplication is cheaper than target-based deduplication. "The fact that it is built into a lot of backup software solutions may contribute to the perception that it is cheaper," said Whitehouse. "But most of those vendors typically charge extra for that feature. And, just because you are not buying a target deduplication system doesn't mean that you don't need storage. I think if you look at the total cost of ownership it might be cheaper, but not significantly so."

Editor's Tip: For more information, learn about  source vs. target deduplication in this tip.


Data deduplication is one of the hottest technologies in backup and recovery today. Bookmark our special section on data reduction and deduplication for the latest news and expert advice. If you're new to the technology, check out our dedupe for beginners tutorial on deduplication best practices. Finally, test your knowledge with our deduplication explained quiz.

Dig Deeper on Data reduction and deduplication

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.