Data deduplication software has become more reliable and affordable in recent years, giving users of deduplication hardware a viable alternative. But, according to users and analysts, deduplication appliances are still good choices for many types of organizations.
According to Rachel Dines, an analyst at Forrester Research, in general dedupe hardware appeals more to larger companies that are storing more massive amounts of data and need fast throughput. “The dedupe algorithms in the software and hardware solutions are really very similar,” she said. On the other hand, Dines said dedupe hardware is optimized to support data deduplication; it isn’t just a disk with software running on it. Dedupe appliances also have the advantage of usually being self-managing, so they can do their own provisioning. And, many of them have different ways of interfacing with backup software.
“These kinds of capabilities first emerged with virtual tape libraries [VTLs] that provided tape emulation to ease that technology transition. Now, with developments like the Symantec Corp. OpenStorage (OST) API, deduplication hardware can interface and connect with backup software and provide more granular management,” she said. Dines also noted that almost all of the big players in backup, for example, CommVault Systems' Simpana, EMC Corp.’s NetWorker, IBM Corp.'s Tivoli Storage Manager (TSM) and Symantec’s NetBackup, are offering both source and target-side deduplication in their software.
However, Dines said, “Source-side deduplication doesn’t work well on databases or anything transactional in general because it involves processing overhead on the host—potentially up to 25%. When these types of workloads are deduplicated on the target side, it won’t have that overhead, whether it is hardware or software deduplication,” she explained.
Now that there are so many software backup products that have source and target deduplication, a lot of organizations will move in that direction, she said. Still, data dedupe appliances like EMC Data Domain's won’t disappear, they will just tend to migrate upstream to larger companies with larger environments, especially those in the multi-petabyte range.
Deduplication appliances not just for enterprises
One midsized organization that is keeping its commitment to dedupe hardware is the Jackson County Intermediate School District, in Michigan, which has adopted Exagrid Systems' appliances for its purposes. Greg Wade, a network engineer at the school, said prior to installing Exagrid they were backing up to a LUN on the SAN. With “traditional” Symantec backup, Wade said he was only able to retain 10 days of data in the 9 TB he had reserved.
Backups finish on Sunday and there is plenty of dead space because the jobs don’t all run back-to-back. So it has significantly cut down on our backup window.
Greg Wade, network engineer, Jackson County Intermediate School District
Initially Wade said the organization looked at EMC Data Domain's dedupe hardware. But they thought it was too expensive. Subsequently, they considered Exagrid and decided it was more affordable, and selected a 4 TB system. “Once we did that we were able to immediately go to 12 weeks of retention, right off the bat. We put that unit at our offsite data center,” he explained.
Wade said because the district had its own private fiber connection there was no problem with speed. After a year, he added another Exagrid unit to accommodate growth. The two units are stacked and function as one unit for management purposes. Between the two units, Wade said the district now backs up 60 TB at almost a 7:1 compression.
In terms of management, “it has basically been set it and forget it,” he said. The organization still uses Symantec Backup Exec. “All we had to do was change the target location when we created the share on Exagrid,” he added.
Of course, having the deduplication appliance provided immediate benefits in terms of backup speed, too. “Before Exagrid, I would start backups at 5:00 p.m. on Friday, and on Monday they were still running. They ran all weekend. Now I have all kinds of headroom. Backups finish on Sunday and there is plenty of dead space because the jobs don’t all run back-to-back. So it has significantly cut down on our backup window,” he said.
Dedupe hardware vs. software: What's the best choice for your organization?
James Brissenden, senior strategy consultant for GlassHouse Technologies, said there are a few different ways to accomplish deduplication. However, he agrees that users usually get more performance from hardware. “Hardware is especially useful where you have a large data store,” he added.
According to Brissenden, the first thing you should look at when considering the hardware or software choice is what backup software you currently have, as well as your performance requirements. It's possible your existing software capabilities may be able to meet your performance requirements for dedupe. Replication is another thing to consider, he added. “You may want to leverage your dedupe system to get data offsite for DR—some of the backup hardware platforms offer replication and some don’t,” he explained.
Brissenden advised to be wary of competing deduplication ratio claims. “There are a lot of scare tactics out there but the reality is that the ratios are mostly driven by the nature of your data,” he said.
Also, Brissenden said that support for the Symantec OST API is also crucial, particularly with virtualization. “You want to be able to create copies and have your applications and dedupe devices be “aware,” he added.
Finally, added Dines, it's important to remember that all deduplication is essentially software, but there are two different ways implement it in backup environments: packaged with hardware (i.e., EMC Data Domain or IBM ProtecTIER) or packaged with software (TSM, NetBackup, etc). Dines said disk libraries with deduplication features usually run between $5,000 and $7,000 per useable TB (before deduplication). Software products are usually an incremental add-on to the backup software and, of course, they don’t include the compute resources needed to do the hashing (that would come from either your host, if it was source-side, or the media server if it was target side), nor the actual storage, “so it is hard to compare the two on a terabyte basis,” she said.
About this author: Alan Earls is a frequent contributor to SearchDataBackup.