As data deduplication becomes more common in data backup implementations, a slew of vendors have thrown their hat in the ring with Data Domain (recently acquired by EMC Corp.), the dedupe market's heavyweight.
Though the data deduplication market has matured quickly, "it's still the Wild West," in terms of widespread implementation, according to Arun Taneja, founder and consulting analyst at Taneja Group. And he said that because vendors haven't done apples-to-apples comparisons of their products, "there's a continuing mystery in the market about who does well under what circumstances. Every one has places they do well and don't do well."
The data deduplication effect
"Deduplication is an effect, not even a technology," said Brian Biles, founder and vice president of product management at Data Domain. "There are lots of different ways you can do it, and a lot of them have varying side effects. A lot of key issues boil down to where it is the best fit."
Data Domain continues to dominate the market. Its appliance dedupes inline, as data is being written to disk. Bill Andrews, CEO of Exagrid, considers Data Domain their No. 1 competitor. "Data Domain was first, so they got the world thinking inline," said Andrews. Exagrid deduplicates post-process, after backup data has been written to disk.
Appliance maker Sepaton Inc. uses the concurrent dedupe method, which is similar to post-processing but starts working on data after the first backup has completed, rather than after all backups have finished. Nexsan Systems cobranded its recent entry into the market with FalconStor Software, coming out this month with an data deduplication backup appliance that includes MAID technology and can dedupe using either concurrent or post-processing technology.
Inline vs. post-processing dedupe
Tom Cook, CEO of Permabit, argues for inline deduplication, which Permabit uses in its dedupe appliance. "The reason you do dedupe as post-processing is because you can't do it quickly enough to keep up," he said. "It introduces the challenge of ending up with a dedupe window."
But analyst Taneja said that it matters less and less whether you deduplicate inline or post-process, especially as the concurrent method has developed. "When deduplication started, there was a genuine large difference between the two technologies," he said. "The post-processing guys have become more sophisticated over the past 12 months." When running concurrent dedupe, prioritizing a smaller job to get backed up first is a best practice, according to Taneja. "Don't do Oracle finances, but do a job that only takes five minutes. Once that job is done, you start parallel processing and dedupe happens almost in parallel with data ingestion."
Michael Passe, storage architect for Beth Israel Deaconess Medical Center in Boston, chose Data Domain two years ago and is now running three generations of the hardware in his shop. When it comes to using dedupe technology, "You've really got to know your data," said Passe. He likes the inline deduplication: "It made more sense because it's not altering the data after the fact." Doing it post-processing, said Passe, means "you've got this thing basically mucking with the data, and it makes us worry about longer term corruption. If we're going to compress data, we're going to compress as it goes in."
As far as scalability, "the logical progression of processor technology has gotten us to the next step," he said, since Data Domain's architecture depends on Intel making its CPUs faster every year.
Bob Rader, lead storage administrator at the University of New Hampshire (UNH), chose Sepaton as part of a backup consolidation project when moving from streaming tape drives to a virtual tape library (VTL), and uses Sepaton's appliance with deduplication for the school's primary backup. Rader liked the concurrent dedupe method because "we didn't want anything in the way as the backups are running," he said. The other main factor was scalability; Sepaton adds nodes as data grows. "We can grow in small incremental upgrades to become very large," Rader said. "With most of the other vendors, you have to buy other boxes."
Rader compares buying multiple dedupe appliances to the problem of adding primary storage arrays. "You have new storage, you fill it up, you have to buy a second one, and now you have X number of arrays with all this isolated free space," he said.
Taneja calls this the "global repository syndrome" that could develop as the dedupe market matures further. "As customers start to make disk-based protection larger, they want a single repository," he said. "Because customers haven't reached the point where they have 12 nodes already, we haven't tested global repository syndrome. But we're headed toward that."
Deduplication technology signaling end of tape backup?
Deduplication discussions inevitably lead to the perennial hot topic of disk vs. tape. Exagrid's Andrews thinks that dedupe effectively signals the end of tape. "Nobody really likes tape, but you couldn't afford to get off it" before deduplication, he said, predicting an end to tape in the midmarket within five years.
Permabit CTO Jered Floyd agreed. "What we've seen already is that dedupe on disk has accelerated the death of tape," he said. "The main thing that tape had going for it was the measure of cost efficiency, which dedupe has destroyed."
But Jay Livens, director of marketing at Sepaton, disagrees. "A lot of companies have very strict requirements around retaining data for years," he said. "There's a period of time in which tape does make sense for deeper archive."
Other vendors argue that deduplicating data with software is the way to go. "It's better to move deduplication closer to the information sources," said Matthew Lodge, senior director of product marketing for information management at Symantec Corp., which incorporates dedupe into its archiving and backup software. "As it moves forward, it's going to be built into the infrastructure." CommVault performs dedupe with a "hybrid approach," said Dipesh Patel, senior product marketing manager. "We do it further upstream, closer to the client but not on the client," he said.
Ultimately, each vendor handles dedupe differently, according to UNH's Rader, and users should get as much information as they can from their vendor when deciding how to set up their deduplication process. "They each have their own algorithms," he said. "The more you know about how a vendor's technology works, the more you're able to do things right from the get-go."
About this author: Christine Cignoli is a Boston-based technology writer. Visit her at www.christinecignoli.com.