In my previous article, I identified the different types of data reduction and data deduplication technologies, including hardware or software compression, file deduplication, block/variable block deduplication, delta-block optimization, and application-aware deduplication. I also discussed how data reduction and dedupe can be implemented as in-line and post-processing in backup software, network-attached storage (NAS) appliances or virtual tape libraries (VTLs). This tip will examine the strengths and weaknesses of data deduplication and data reduction technologies.
Hardware compression or software compression
Compression is a good choice for data that is uncompressed and unencrypted. It is also useful for extending the life of older storage systems. Hardware compression tends to be faster than software compression, with measurably lower latency per transaction. However, software compression such as LZO compression, can be downloaded at no cost.
Both types of compression are not very good at reducing duplicate data. If the files have been stored multiple times, no matter how good the compression algorithm is, there will be multiple copies of the compressed files.
In addition, compression is also not a very effective choice when the data or files are encrypted or compressed (such as Microsoft Excel, Word, and PowerPoint, PDFs, JPEGs, MPEGs, Zip files, compressed data streams, and even some databases). The data reduction benefits in these cases are negligible. It's like attempting to zip a zip file. Sometimes the end result is larger than the original.
File deduplication is very strong at reducing duplicate VMware .vmdk files, especially ISO and template files such as virtual desktop infrastructure (VDI) templates. It is also strong in content-addressable storage (CAS) where files must have proof (usually hash marks) they haven't changed from the time they were stored for compliance reasons. By limiting data reduction to identical files only, it preserves each unique file in its entirety.
File deduplication is not the best choice for storage reduction. It does not reduce duplicate data that may reside in multiple files or files that vary slightly from the original. There is also a rehydration penalty when the deduped data is read or restored.
Block/variable block deduplication
Block/variable block deduplication is extremely effective at reducing backed up, replicated or snapshot data. As the amount of deduped data increases within the datastore, so do the data deduplication ratios. The systems that can scale well with raw capacity (not just predicted as effective capacity) will provide the greatest value. It also allows more data to remain on disk before it has to be moved off to archive targets. This makes recovery of that data much faster.
Another strength is with reducing duplicate VMware .vmdk files (ISO, template files, VDI) and duplicate data between .vmdk(s).
However, block/variable block data reduction must first be rehydrated when moved to other types of storage, such as tape, optical or even another target disk storage device. It also adds noticeable time to the restoration of backup or snapshot data. Though there are exceptions when performance is irrelevant, the additional latency usually makes it unacceptable as primary storage for many applications because of the decreased response times.
Additionally, like file deduplication, block/variable block is not very effective with already compressed and/or encrypted data, and will not dedupe data it does not see.
Delta block optimization
Delta block optimization reduces the amount of data being backed up or snapped. This reduces the amount of storage and bandwidth for ongoing data protection.
What it doesn't do is reduce the blocks of data that may be duplicated in different servers or even files, even though it is backed up by the same software and stored in the same data store. Some backup software has artifacts in the code because it was originally written for tape backup. It may require a periodic full volume or synthetic (virtual full volume backup) backup, significantly reducing the overall benefit.
Application-aware deduplication is the most effective data reduction for primary storage because it's so successful with compressed files while adding nominal file rehydration latency. However, this technology requires a "reader" or a filter that runs either on the application or in the NAS to enable the nominal file rehydration latency. It also requires an appliance that looks at the datastore, performs all of the application-aware dedupe after it is first written to the data store, and then provides the ongoing meta data.
In-line dedupe requires less storage than post-processing deduplication because it is deduping before the data is written to the data store.
Unfortunately, its maximum performance tends to decrease as the target approaches its limits because as the data store increases, so does the database that is being compared to the incoming data. Eventually, there is diminishing marginal returns where performance degradation is greater than the increased benefits of a single data store. The one exception to this is the NEC Corp. HYDRAstor grid architecture that allows processing nodes to be added independently of the capacity.
Post-processing dedupe allows streaming data to be accepted by the data store at much higher speeds than in-line. This is especially effective for virtual tape libraries where the backup window (performance) is more important than the extra savings in storage. It also allows application-aware dedupe to work with primary storage and have zero impact on write performance. The downside is that all of the data must first be written before it is deduplicated, meaning more disk storage is required than in-line.
Deduplication can be found in six different backup software packages, including Asigra Inc.'s Televaulting, CommVault Simpana, EMC Corp.'s Avamar, Hewlett-Packard (HP) Co.'s Data Protector, IBM Corp.'s Tivoli Storage Manager (TSM) and Symantec Corp.'s NetBackup Pure Disk. These packages allow any data storage to be the target storage do not require specialized high-priced dedupe storage or appliances. This means fewer storage vendors and less hardware or systems to manage.
While the deduplication capabilities are typically included in the software with little to no license fees, the downside is that it tends to create backup software lock-in (difficult to migrate data between backup software products). In some cases, a different agent has to be installed. Asigra's offering is the exception to this since it requires no agents but requires either physical or virtual server appliances on each LAN segment to be backed up.
NAS appliances/target storage
Deduplication is best known as a NAS appliance or NAS system. The key advantage is that this type of storage target can be added to the current backup infrastructure with minimal changes while receiving immediate benefit. It tends to work with most of the primary backup, server replication and snapshot software packages on the market today (with the exception of dedupe and/or encrypted backup software).
The downside is the inverse of the backup software that it tends to lock-in the dedupe storage hardware vendor. It is complicated migrating data to other data stores. The primary storage vendor is often not the dedupe storage vendor, adding another system and vendor to manage.
Virtual tape library (VTL) deduplication systems are ideal candidates for environments that backup to tape today, while requiring faster backups, restores, at a lower cost. VTL deduplication makes that possible with zero changes to the backup methodologies.
The downsides may include an inability to export to real tape. VTL dedupe also can't save much in actual tapes because deduplication must be rehydrated before moving to actual tape (Sepaton's product doesn't have this issue because it scales very high and doesn't export to tape). There is also a lack of integration with all of the media servers and, overall, VTL dedupe is a temporary solution.
About the author: Marc Staimer is the founder, senior analyst, and CDS of Dragon Slayer Consulting in Beaverton, OR. The consulting practice of 11 years has focused in the areas of strategic planning, product development, and market development. With over 28 years of marketing, sales and business experience in infrastructure, storage, server, software, and virtualization, he's considered one of the industry's leading experts. Marc can be reached at firstname.lastname@example.org.