Get started Bring yourself up to speed with our introductory content.

Cloud-based backup and software deduplication explained

In the second part of our series on deduplication in 2013, independent backup expert Brien Posey explores software deduplication and cloud backups.

This is the second part of a nine-part series on deduplication. For easy access to all nine parts, check out our quick overview of Deduplication 2013.

Today, practically every backup product vendor includes a deduplication feature in its software. Over the last few years, backup deduplication has become the norm for major backup software vendors, such as Veeam, Symantec and Acronis. In spite of the fact that deduplication has become little more than a feature checklist item for backup software providers, not all deduplication features are cre­ated equal. Deduplication has become an essential feature for cloud-based backup providers as well. All of the major Backup as a Service (BaaS) providers employ some form of data deduplication. Part 2 of our series on deduplication in 2013 takes a look at software dedupe and cloud backups.

Software deduplication

Some backup applications per­form source-based deduplication. This form of deduplication compacts the data on the source server prior to performing a backup. Although the deduplication process is com­pletely transparent, deduplication, by its very nature, is CPU-intensive; it also tends to generate a consider­able amount of disk I/O.

Not all backup software performs source deduplication. Some prod­ucts perform target deduplication. Target deduplication performs the deduplication process on the back­up target rather than on the source server. In the case of a software-based solution, this usually means deduplicating data on the backup server.

Target deduplication can occur in line, as data is being written to the backup server, but often the deduplication is post process, which means that deduplication is sched­uled rather than occurring in real time.

Backup vendors are not the only ones who are performing software-based deduplication. Microsoft has included a native file-system deduplication feature in Windows Server 2012. This native dedupli­cation occurs at the data source, but it is post-process. Because source-based deduplication can be resource-intensive, Microsoft has taken measures to make the deduplication process less disruptive to the server.

One of the things that Microsoft has done is to delay the deduplica­tion of new data. When a new file is created, that file is not scheduled for deduplication for several days. This prevents Windows from wasting resources deduplicating temporary files. Another thing that Microsoft has done is to allow exclusions. Compressed data, such as ZIP files or compressed media files (MP3, MPEG 2, JPEG, etc.), tend not to deduplicate very well, because they are already compressed. System resources can be conserved by con­figuring Windows not to dedupli­cate these types of files.

Cloud backups

Deduplication is essential to mak­ing cloud-based backups practical. The deduplication process reduces the amount of data that must be uploaded to the cloud, which improves the speed of the backup and reduces storage and Internet band­width costs for the BaaS provider.

Some BaaS providers, such as Mozy and LiveDrive, use file-level deduplication (single-instance stor­age). The way in which this type of deduplication is implemented varies considerably from one provider to the next. Some providers perform file-level deduplication on a per-computer basis. In other words, if multiple copies of a file exist on a computer, then only a single copy of the file is backed up.

Mozy uses a slightly more sophis­ticated form of single-instance storage that is based on private key encryption. If multiple computers share the same private key, then single-instance, storage-based deduplication can occur across all of those computers.

Some BaaS providers perform deduplication behind the scenes without really exposing the pro­cess to the administrator. However, there are some BaaS providers, such as CrashPlan, that let administrators configure the deduplication pro­cess. CrashPlan lets administrators choose the degree to which data is deduplicated. Full deduplication is CPU-intensive, but very effective in reducing the size of the data that needs to be uploaded. CrashPlan also offers less CPU-intensive, but less efficient deduplication models. They also have an automated set­ting, in which the software automati­cally chooses the most appropriate form of deduplication.

Part 3 of this series explores compares hardware and software dedupe and discusses using them together.

About the Author:
Brien M. Posey, MCSE, has received Microsoft's MVP award for Exchange Server, Windows Server and Internet Information Server (IIS). Brien has served as CIO for a nationwide chain of hospitals and has been responsible for the department of information management at Fort Knox. You can visit Brien's personal website

Next Steps

Get the inside scoop on hardware-based deduplication

Part 3 of our series details software and hardware dedupe

Learn more about target deduplication

Read about the importance of source deduplication, as well as source vs target dedupe

Learn more about global deduplication and who's using global dedupe

Explore more about the deduplication tax and why it's important

Dig Deeper on Data reduction and deduplication

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.