Data deduplication technology can be implemented in two ways: either in software products installed on a dedicated server or integrated into backup/archiving software. Software-based data deduplication is typically less expensive to deploy than dedicated hardware, such as the Data Domain Inc. DDX series appliances, and it should require no significant changes to the physical network. However, software-based deduplication can be more disruptive to install and more difficult to maintain. Lightweight agents are typically required on each host system (or client) that must be backed up, allowing the client to communicate with a backup server running the same software. This client/server software will need updating as new versions become available or as each host's operating environment changes over time. The disruption can be even greater if you're replacing the backup engine with an entirely new product because the backup administrator must recreate backup job configurations, schedules and alerts from scratch. Deduplication at the source is also processing intensive. Consequently, the host backup server must be configured for the task.
Software-based data deduplication products
EMC Corp.'s Avamar software product performs in-band deduplication at the host server (the source) using the SHA-1 algorithm. Avamar employs a central management scheme to inspect data in the entire environment, but the actual deduplication is performed at each server before being sent to the backup storage platform. This saves storage space at the backup target and reduces network congestion. EMC reports plans to incorporate Avamar technology into its own backup software and virtual tape library (VTL) system in the near future.
Symantec Corp. provides software-based deduplication in its Veritas NetBackup product through a feature called PureDisk, which uses a proprietary hash algorithm to perform deduplication inline at each host server. NetBackup PureDisk 6.2 supports tape targets and the Backup Reporter monitoring tool. NetBackup 6.5 offers even better integration and support for deduplication, VTL and third-party appliances.
Sepaton Inc. implements deduplication using DeltaStore software, an option on its S2100-ES2 VTL hardware product. Like PureDisk, DeltaStor uses a proprietary hash algorithm, but the S2100 deduplicates data at the VTL (the storage target). This means backup traffic is sent to the VTL before deduplication is performed, so there is no decrease to network traffic. Sepaton also works differently than other deduplication schemes. Where the first iteration of data is written and later iterations receive pointers, DeltaStor writes the latest version and replaces the previous iterations with a pointer, a technique called forward referencing, which promises faster restores.
Compression, encryption and data deduplication
One of the stickiest issues with data deduplication is the relationship between compression, encryption and deduplication. Traditional compression works by eliminating redundancy in files, deduplication can eliminate redundant files, blocks or bits, and encryption turns that data into a data stream that is random by its nature. So if you encrypt data first, it may be impossible to compress or deduplicate it. Ideally, data should be compressed and deduplicated first, and then encrypted as needed. This isn't difficult when compression and deduplication are performed at the host server using backup software, and the resulting data stream is encrypted on the way to the backup target using a dedicated appliance or at the tape library or LTO-4 drive. However, this may present difficulties when deduplicating at the target storage system. For example, if the backup data is encrypted by an inline appliance and then sent to a deduplication-capable storage system like the Sepaton S2100, it may be impossible to further compress or deduplicate the encrypted data.
Check out the entire Data Deduplication Handbook.