This is the second part of a nine-part series on deduplication. For easy access to all nine parts, check out our quick overview of Deduplication 2013.
Some backup applications perform source-based deduplication. This form of deduplication compacts the data on the source server prior to performing a backup. Although the deduplication process is completely transparent, deduplication, by its very nature, is CPU-intensive; it also tends to generate a considerable amount of disk I/O.
Not all backup software performs source deduplication. Some products perform target deduplication. Target deduplication performs the deduplication process on the backup target rather than on the source server. In the case of a software-based solution, this usually means deduplicating data on the backup server.
Target deduplication can occur in line, as data is being written to the backup server, but often the deduplication is post process, which means that deduplication is scheduled rather than occurring in real time.
Backup vendors are not the only ones who are performing software-based deduplication. Microsoft has included a native file-system deduplication feature in Windows Server 2012. This native deduplication occurs at the data source, but it is post-process. Because source-based deduplication can be resource-intensive, Microsoft has taken measures to make the deduplication process less disruptive to the server.
One of the things that Microsoft has done is to delay the deduplication of new data. When a new file is created, that file is not scheduled for deduplication for several days. This prevents Windows from wasting resources deduplicating temporary files. Another thing that Microsoft has done is to allow exclusions. Compressed data, such as ZIP files or compressed media files (MP3, MPEG 2, JPEG, etc.), tend not to deduplicate very well, because they are already compressed. System resources can be conserved by configuring Windows not to deduplicate these types of files.
Deduplication is essential to making cloud-based backups practical. The deduplication process reduces the amount of data that must be uploaded to the cloud, which improves the speed of the backup and reduces storage and Internet bandwidth costs for the BaaS provider.
Some BaaS providers, such as Mozy and LiveDrive, use file-level deduplication (single-instance storage). The way in which this type of deduplication is implemented varies considerably from one provider to the next. Some providers perform file-level deduplication on a per-computer basis. In other words, if multiple copies of a file exist on a computer, then only a single copy of the file is backed up.
Mozy uses a slightly more sophisticated form of single-instance storage that is based on private key encryption. If multiple computers share the same private key, then single-instance, storage-based deduplication can occur across all of those computers.
Some BaaS providers perform deduplication behind the scenes without really exposing the process to the administrator. However, there are some BaaS providers, such as CrashPlan, that let administrators configure the deduplication process. CrashPlan lets administrators choose the degree to which data is deduplicated. Full deduplication is CPU-intensive, but very effective in reducing the size of the data that needs to be uploaded. CrashPlan also offers less CPU-intensive, but less efficient deduplication models. They also have an automated setting, in which the software automatically chooses the most appropriate form of deduplication.
Part 3 of this series explores compares hardware and software dedupe and discusses using them together.
Brien M. Posey, MCSE, has received Microsoft's MVP award for Exchange Server, Windows Server and Internet Information Server (IIS). Brien has served as CIO for a nationwide chain of hospitals and has been responsible for the department of information management at Fort Knox. You can visit Brien's personal website at www.brienposey.com.
This was first published in April 2013