A comparison of data compression and data deduplication technologies for SMBs

Data deduplication can often refer to technologies that aren't really deduplication at all. In this tip, we'll discuss the different types of data compression, the pros and cons of each and offer recommendations for SMB storage environments.

Marc Staimer, Dragon Slayer Consulting

Published: 09 Mar 2009

Data deduplication is one of the hottest technologies in data reduction today. But the term "data deduplication" can be confusing because it's often used to describe technologies that aren't really deduplication at all. There are five primary types of data reduction: hardware and software compression; file deduplication; block/variable block deduplication; delta block optimization; and application-aware data reduction. In this article, we'll explain the different types of data compression, the pros and cons of each, and offer recommendations for SMB storage environments.

Hardware and software compression

Hardware and software compression is transparent to applications and storage hardware. It uses an algorithm to reduce the size of files by eliminating redundant bits. However, if the files have been stored multiple times, no matter how good the compression algorithm is, there will be multiple copies of the compressed files.

Compression typically provides an aggregate data reduction ratio range of approximately 1.2:1 to as high as 10:1 (depending on the type of data). Unfortunately, compression has nominal impact on already compressed files such as Microsoft Office files, .pdfs, .jpegs, .mpegs and zip files. The best results are achieved when compression is applied to backup or secondary data and data that is not already compressed.

File-level deduplication

File-level deduplication eliminates multiple copies of the same file. Redundant files are replaced with a pointer to the unique version of the file. File deduplication typically provides an aggregate deduplication ratio range of approximately 2:1 to as high as 10:1 (data type dependent). File-level deduplication is a coarse-grain type of dedupe, so it doesn't reduce files that may change only slightly from previous versions. File deduplication fits best in content-addressable storage (CAS) where files cannot be altered, in backup or secondary data, and in many remote offices or branch offices (ROBOs).

File-level deduplication is primarily available from networked-attached storage (NAS) vendors, including NetApp Inc.'s Ontap and Sun Microsystems Inc.'s ZFS, and from CAS vendors, including Active Circle, Bycast Inc., Caringo Inc., EMC Corp., Hewlett-Packard (HP) Co. and Permabit Technology Corp.

Block/variable block deduplication

Block/variable block deduplication eliminates redundant or duplicate data by retaining just one unique instance or copy of blocks or chunks of data. Redundant data is replaced with a pointer to the block of a unique data copy. Block/variable-block deduplication typically provides an aggregate deduplication ratio range of approximately 3:1 to as high as 80:1 (data type and scalability dependent). Block/variable block-level deduplication fits best for archive data, and ROBOs.

Both types of deduplication become increasingly efficient, as the same data gets backed up or archived multiple times to the data repository. Increased data equals increased data reduction ratios and increased value. Additional value comes from the longer backup/archive disk retention periods, which produces even more value with faster recovery time objectives (RTOs) and further decreased costs by reducing or eliminating tape backups.

A downside to both deduplication methods is the read/write performance that shows up when used with primary data storage. When writing, the deduplication database is checking against what has already been written before it completes the write. This adds noticeable latency to the write. On reads, the deduplication database must reconstitute the files to full hydration. This again adds very noticeable latency to the read. This is why dedupe is best suited for secondary or backup data.

However, some implementations of deduplication are post-processing, which means deduplication occurs after the data has been written. FalconStor Software's SIR, Sepaton Inc.'s virtual tape library (VTL) and NetApp's VTL all use this method. This methodology eliminates much of the write latency penalty, but does nothing for the read penalty.

Block/variable block deduplication can be found in VTLs, NAS appliances or in backup software:

VTLs

Copan
Data Domain
Dell
Data Domain
EMC
FalconStor
HP
Quantum
Sepaton

NAS

Copan
Data Domain
EMC
ExaGrid
Dell
FalconStor
Quantum
NEC HYDRAstor
NetApp

Backup Software

Asigra Televaulting
CommVault Simpana
EMC Avamar
Symantec NetBackup Pure Disk

Delta block optimization

Delta block optimization is designed to reduce the amount of data backed up from the source and the amount of data stored. When the most recent version of a file that has already been backed up is backed up again, the software attempts to figure out which blocks are new. Then it writes only these blocks to backup and ignores the blocks in the file that haven't changed.

This technique has a similar shortcoming to file deduplication and compression though. If two users sitting in the same office or two separate servers have identical copies of the same file or identical blocks, then delta block optimization will create two identical backups instead of storing just one.

Application-aware data reduction

Application-aware data reduction eliminates duplicate storage objects within and between different files. It's designed for and best suited with primary data storage. It works by reading the files (post processing after they are initially written) and then expands them from their compressed formats (.pdf, .jpeg, Microsoft Office, .mpeg, Zip, etc.). It then looks for and eliminates common storage objects across all of the files and optimizes and recompresses the files.

So if there is a .jpeg image, and it is inserted in both a Word document as well as a PowerPoint presentation, only one copy of the three images is stored. A reader on the user's PC, server or NAS head eliminates any noticeable read latency penalty. Application-aware data reduction ratios typically range from 4:1 to 10:1, which is usually two to five times greater than other data reduction technologies when used on primary data storage.

Application-aware data deduplication is currently only available from Ocarina as an appliance (also resold by BlueArc and HP).

Data reduction recommendations for SMBs

So how does an SMB know what or if to implement any of these technologies? It depends on the data reduction solution, how it is implemented and what is currently in place.

But doing an apples-to-apples comparison of the different data reduction technologies, implementations and vendors can be a daunting task because they come in so many different packages and implementations. Compression is available as a hardware appliance (examples include those from Hifn Inc. and Storwiz,) and is sometimes available as a hardware option on some storage systems (NetApp).

About the author: Marc Staimer is the founder, senior analyst, and CDS of Dragon Slayer Consulting in Beaverton, OR. The consulting practice of 11 years has focused in the areas of strategic planning, product development, and market development. With over 28 years of marketing, sales and business experience in infrastructure, storage, server, software and virtualization, he's considered one of the industry's leading experts. Marc can be reached at [email protected].

A comparison of data compression and data deduplication technologies for SMBs

Data deduplication can often refer to technologies that aren't really deduplication at all. In this tip, we'll discuss the different types of data compression, the pros and cons of each and offer recommendations for SMB storage environments.

Dig Deeper on Data reduction and deduplication

UltiHash compresses, deduplicates data on a binary level

data reduction

How SSD data reduction can help enterprises

data compression