I see parallels between the way "data deduplication" is being used and the way continuous data protection (CDP) was used. CDP was all the rage for about a year or two, but then people came to the realization that what was important was granular recovery, not CDP. CDP was just one way to get there. When it comes to managing exploding storage growth cost-effectively, what people care about is optimizing the use of their physical storage capacity to reduce cost, not necessarily "deduplication." So do we need a broader term?
The need to optimize physical storage capacities is not new, it's just more pressing than ever these days. There are technologies designed to do this that have been available for years, and are probably in use in most of your shops already. Let's review some of them:
- Data compression. Enterprises have been using compression technology to save space on physical tapes from vendors like IBM Corp., Quantum (ADIC), Spectra Logic Corp. and Sun Microsystems Inc. for decades. Many applications, such as Oracle and other databases, already compress data before it is stored. Even Microsoft Office 2007 now compresses most files before it stores them.
- Enhanced, real-time compression. Pioneered in 2005 by Storwize, this approach compresses data in real time as it is stored using enhanced but proprietary compression algorithms, and achieves capacity optimization ratios that are three times to five times what standard Lempel-Ziv-based compression algorithms achieve.
- Incremental and differential backups. For most enterprises, their primary storage does not change much -- likely less than 5% day to day. If you're backing that data up every day, then you're backing up a lot of data that didn't change. Why? This is the question that drove the development of incremental and differential backup strategies. These capabilities are supported by all major enterprise backup software, and once the ability to create "synthetic full backups" without operator intervention hits the market, these approaches became much more usable since they no longer complicated restore and recovery procedures.
- Copy-on-write snapshots. As disk has become more affordable, the use of snapshots has increased for data protection, test and development, and other regular IT operations. When successive snapshots are used for backup (e.g., snapshots that are taken of an application state every four hours, etc.), the copy-on-write technology will only record the delta difference between the older snapshot and the newer one, relying on reference pointers to assemble the rest of the later snapshot if and when it is needed. Especially in data protection operations where so much data is redundant anyway, copy-on-write snapshots can save a lot of disk space. Most systems vendors, including Dell, Hewlett-Packard (HP) Co., IBM, and Sun support copy-on-write snapshots through an integrated volume manager that ships with their system platform. Most storage appliance vendors, including DataCore Software Corp., EMC Corp. (RecoverPoint), FalconStor Software, IBM (SAN Volume Controller), and InMage Systems Inc., support them, and most storage array vendors, including EMC, HP, Hitachi Data Systems (HDS), and IBM do as well. Scale-out secondary storage platform providers such as NEC, Permabit Technology Corp., and Tarmin Technologies Ltd. also support space-efficient snapshot technologies on their massively scalable file system platforms.
- File-level single instancing. This approach has been used for years in archive products such as EMC Centera, Hitachi Content Archive Platform, Symantec Corp. Enterprise Vault and many others. As files are stored, they are fingerprinted using hash algorithms and then retained in an index. As new files are stored, they are compared against files in the index and when duplicates are found, they are replaced with pointers. Note, however, that if one bit in a file is changed, single-instancing algorithms view it as a unique file and store it in its entirety. Single instancing can be thought of as an older, less efficient version of data deduplication that can be used when other relevant technologies are not available. Single instancing is generally included at no additional charge with the platforms on which it is available, and it clearly offers storage capacity and cost savings compared to not using it when other technologies are not available.
- Data deduplication. A variant of the concepts used in file-level single instancing, data deduplication segments incoming data streams into sub-file level chunks, fingerprints the chunks, and then stores the fingerprints and the associated chunks in an index. When identical chunks are found, they are removed and replaced with pointers. This ability to "chunk" at the sub-file level allows data deduplication to achieve greater capacity optimization ratios than file-level single instancing; e.g., a Word file where only the date was changed still exhibits a high level of commonality with the previous version, and data deduplication will recognize this redundancy while storing only the net new bits separately. After the deduplication process is complete, data is generally compressed using standard compression algorithms prior to being stored. Data deduplication works well with certain data types but not others, with images (.tiff, .gif, etc.) presenting particular problems.
- De-layering combined with application-specific capacity optimization algorithms. One approach introduced in 2008 by San Jose, Calif.-based startup Ocarina Networks that seems to work quite well for images converts objects in the data stream to their native formats, and then uses file type-specific algorithms to store them very space-efficiently. For example, a compound document such as a PDF would be de-layered to discover that it actually includes a Word document, several .gif images, and an Excel spreadsheet. Algorithms that are specific to each of these file types are then used to process them separately so that they can be stored in a very space efficient manner.
Finally, data tiering with automatic file migration can be viewed as another "capacity optimization" technology in the sense that it is designed to reduce the need for expensive, primary storage capacity while still making files accessible online. File-stubbing was an early version of this approach used by many archiving vendors. As files aged and were accessed less, they were automatically migrated to a less expensive secondary storage tier but a stub was left on primary storage. The stub took up very little space, but for all practical purposes made the file look like it was still being kept in primary storage (except for some additional latency in accessing it). Newer approaches leverage tiered storage architectures, giving the appearance that all files are still stored on "primary storage" and easily accessible by end users when in fact most of them are stored on very cost-effective SATA disk tiers combined generally with some form of file-level single instancing or data de-duplication. This approach allows 70% or more of the data currently sitting in primary storage to be moved to disk-based tiers that cost 1/10 as much on a $/GB basis. Vendors with products in this space include Active Circle, Permabit, and Tarmin Technologies.Keep in mind that sometimes these technologies can be used in a complementary manner and sometimes they cannot. Running compression on a deduplicated data stream almost always improves the capacity optimization ratio, which is why vendors always combine deduplication with compression. Attempting to "deduplicate" a compressed data stream (e.g., a Word file that was compressed by Office before it was sent out across the LAN/WAN) will generally not produce capacity optimization ratios as high as deduplicating that same data stream in non-compressed form. The order here is important. If you are going to both deduplicate and compress, you generally need to deduplicate first, then compress. (There is one exception to this with CommVault's source-based capacity optimization technology.) If incremental backups are already in use, capacity optimization ratios from deduplication will be lower than if deduplication was used against a series of successive nightly fulls.
If the goal is to reduce the costs associated with storage infrastructure, then it's clear there are many approaches, all of which seem to be focused on optimizing the need for and/or cost of storage capacity. Since no vendor's product that I'm aware of just does data deduplication (at a minimum they also do compression), and there are very viable approaches that have nothing to do with data deduplication, I think we need a more descriptive term. Perhaps a better umbrella term to describe the technologies targeted at the problem is "storage capacity optimization."
About this author: Eric Burgener is a senior analyst with The Taneja Group. His areas of focus include data protection, disaster recovery, storage capacity optimization and archiving.