First, let's define what we mean by data deduplication, and how it's different from primary storage optimization. The main technology for primary storage optimization (PSO) is compression, while the cornerstone of secondary capacity optimization (SCO) is data deduplication. PSO tools reduce disk capacity on secondary storage (backup and archiving).
Data compression vs. data deduplication
Compression technologies look at a data stream and try to eliminate unneeded 0s and 1s algorithmically such that no data is lost when it’s uncompressed. Data deduplication at a file level will delete a duplicate file and replace it with a pointer. Dedupe at the sub-file level will do the same, except it uses a number of pointers, one each for the sub-file or chunk. Data deduplication doesn’t try to crunch the file as compression does. It looks for duplicates within a repository either at a file level or sub-file level. Compression can also be applied to secondary storage -- a lot of backup data is compressed before it’s written to tape and most SCO products add compression on top of deduplicated data.
Global data deduplication vs. local dedupe explained
Now let’s look at global data deduplication tools. When you install your first data deduplication solution for your backups, the system needs time to extract duplicates at the sub-file level. The capacity reduction may be only 2 to 1 in the first week, with one full and six incrementals, for instance. As more weekly fulls and daily incrementals are done, the ratio will improve, often to about 20 to 1.
If the backups were done intelligently and the same data wasn’t repeatedly dragged to the backup disk, we wouldn’t need data dedupe. Global data deduplication comes into play when a single system can squeeze out the duplication across the entire enterprise, rather than only across each system.
Current data deduplication systems vary in the way they perform dedupe, such as whether they do it inline or post-processing, if they use a virtual tape library (VTL) or network-attached storage (NAS) interface, and so forth. But a major architectural difference is whether the systems are single node or scale-out (sometimes called clustered). Scale-out solutions can perform global data deduplication simply by adding nodes. Even a system of just two nodes enhances reliability as the configuration can withstand a failure of a disk in any node, or the failure of an entire node. The nodes can be managed as a single system to create a global deduplication solution. A single-node system has no visibility into other nodes, so even though there may be chunks or files that are exactly the same on multiple nodes they’ll be viewed as unique data and stored on each node.
Global deduplication will yield better reduction ratios, but when compared to standalone systems, the difference is often less than dramatic.
The merits of global data deduplication may seem obvious, but in practical terms they’re somewhat diminished. Would you want to put all your eggs in one basket with a single solution for the entire enterprise? And if you have multiple subsidiaries, would it be desirable for them to share one backup repository? Probably not. But that’s not to suggest that scale-out solutions are bad; I see scale-out as the preferred architecture as it mitigates proliferation and lets you decide whether to create one monolithic system or multiple standalone systems.
But I’m not convinced that global deduplication is necessarily the primary reason for preferring scale-out products. Global deduplication will yield better reduction ratios, but when compared to standalone systems, the difference is often less than dramatic.
When global dedupe truly matters
However, there are places where global deduplication matters immensely. Take Symantec Corp.’s NetBackup PureDisk as an example. You install it in the main data center with smaller versions at each of your remote locations. All of them are scale-out, but it’s likely the data center installation is indeed multinode while the remote ones are single-node (or dual-node) systems. The data is chunked up at each remote site, automatically checked against the master unit in the data center to see if it already exists there, and then either moved or marked with a pointer. Because the data center unit is the reference point, all data across all remote sites is deduplicated and the master unit is indeed very efficient in terms of data deduplication. Keep in mind that even single-node solutions, such as EMC Data Domain and others, allow such elimination across remote sites. But having a large, scalable unit at the data center does make a difference in this case. EMC recently added a two-node Data Domain system where each node is indeed aware of the other and eliminates duplication across both nodes. We consider this a “quasi” scale-out system that’s perhaps a precursor to a full-blown scale-out solution in the future.
FalconStor Software Inc.’s product is a bit different architecturally. Its VTL is scale-out but doesn’t have integrated data deduplication. Another scale-out FalconStor product, Single Instance Repository (SIR), sits on the same local-area network (LAN) and performs data deduplication on a post-process basis. One would consider this an example of a system capable of doing global data deduplication.
NetApp adds compression to dedupe
NetApp’s approach is unique in the industry because it offers a dedupe feature that’s shipped free of charge (although it has to be licensed). This is the single case in the industry where deduplication, as we define it, is used for optimizing capacity on primary and secondary storage. It uses a post-process method that looks for block-level redundancies to shrink the data. When data is needed by the application it’s presented in the original format, perhaps with a small amount of latency since the file has to be reconstituted. If the storage is being used to store backups, the capacity optimization is done in exactly the same way. This is the only solution we’ve seen that has applied deduplication technology to both primary and secondary data. Recently, NetApp added compression for both primary and secondary data. This makes NetApp unique in its approach to capacity optimization, and blurs the lines I defined earlier, but it currently doesn’t offer global deduplication.
Permabit Technology Corp. is another noteworthy player. Until recently, the firm offered an appliance designed for archival data. It’s classic scale-out, and uses data deduplication and compression to optimize capacity utilization. Permabit recently isolated the deduplication engine and made it available to OEMs who lack PSO or SCO technologies. Because Permabit’s archival system, Permeon, can be used as a backup target, as an archive or as tier 3 primary storage, the company claims its data reduction engine combines the benefits of all capacity optimization technologies and applies equally to primary or secondary storage. BlueArc, LSI and Xiotech Corp. have signed on as OEMs for this technology. From a global deduplication perspective, Permabit’s architecture does indeed meet our definition, as do NEC’s HydraStor and Sepaton’s VTL with DeltaStor appliances.
Global data deduplication is an important feature, especially as it applies to data management across remote sites. And it’s hard to argue against having a smaller number of systems to manage. Being able to scale simply by adding another node with the system automatically redistributing the data on the back end is a great benefit, too, and it’s hard to argue against getting a few more percentage points of deduplication across the enterprise. But don’t chase global data deduplication at all costs. Choose a scalable architecture, but not just for the global data deduplication it delivers.
About this author: Arun Taneja is founder and president of Taneja Group, an analyst and consulting group focused on storage and storage-centric server technologies. He can be reached at firstname.lastname@example.org.
This article was previously published in Storage magazine.
This was first published in March 2011