The evolution of data deduplication technology continues

Data deduplication technology for backup has evolved enormously in the last decade, and it's poised to go beyond just backup.

Data deduplication technology for backup has evolved enormously in the last decade, and it's poised to go beyond just backup. Get Arun Taneja's thoughts on the subject.

New technologies come to market as value-add features for existing mainstream products that are then later merged into these products. It's often the only way a new vendor can bring a product to market, like what happened with data deduplication technology. In the early 2000s, data deduplication came in the form of appliances from companies like Data Domain, Diligent Technologies, ExaGrid Systems, Quantum and Sepaton. The appliances worked in conjunction with existing backup software and the value proposition was simple: install the appliance, point the backup software to it and instead of backing up to tape the backup app would stream the data to the appliance. Backups and restores got faster and more reliable, and with deduplication, disk costs were close to that of tape.

The market for the appliances soared in terms of revenue and acceptance. But why not merge dedupe functionality into the backup software itself instead of using a separate appliance? Today, most backup software vendors offer deduplication as a standard feature in their products.

But the technology continued to advance and source-based data deduplication technology emerged. Rather than dedupe data in the backup software, it could be done at the application server under the control of the backup app. Then only deduplicated data would traverse the network, reducing congestion and improving performance. This concept was introduced by Avamar -- a startup at the time -- and because its method meant using a brand new backup app, few enterprises wanted to risk their valuable corporate data with a product from a relative unknown. But then Avamar was acquired by EMC, and in the hands of a big, well-known company, product sales took off. And now conceptually similar source-based deduplication backup software is available from most major players.

The fundamental evolution of "value-add feature" to "mainstream" is now complete. What can you expect next? While it seems as if there are a lot of data deduplication technology choices to make, for new IT initiatives, it makes little sense to go back to 20th century data protection methods. I believe data should be reduced at the point closest to where it's created and kept in its reduced form through its entire lifecycle, except when it's needed in its full format to be viewed for audits, compliance, analytics and so forth. That logic dictates that all new IT initiatives should use source-based backup software to seal in maximum efficiency from the outset. This approach also works very well for virtualized server environments, given the level of duplication found in virtual machines.

For existing environments, it's trickier. The particular situation will determine whether a dedupe appliance, new backup software with target-based data deduplication or source-based backup software is most appropriate.

The main point, however, is that data deduplication is now mainstream and should be treated that way. Make decisions around data deduplication at a strategic level and not as a patch to an existing problem. We're in a different phase of technology, but it doesn't mean the era of data deduplication appliances is over. The amount of data under the purview of legacy backup software is immense and needs to be tamed. Whether you choose to tame it via a data deduplication appliance or with new backup software depends on factors that are unique to your environment. Most backup vendors can now offer source-based, target-based and appliance-based products as options. And some startups offer unique solutions that make up for the lack of breadth in offerings.

Eventually, we won't need to talk about data deduplication technology as it will simply happen when the data is created by the application and stay deduplicated through its lifespan. But, despite some promising starts, this primary storage data deduplication approach hasn't happened at a pace I expected. NetApp has had this capability for a long time but it's not inline and requires post processing. IBM's Storwize technology is compression based and solves an important but adjacent problem, but it's still not available across the entire product line. Dell bought Ocarina, but little has been done with that technology in almost two years. Still, without question this concept will come to pass and we will quietly put data deduplication to sleep. Until then, work needs to be done and IT life must go on. When you have to make a data deduplication decision, make that decision as strategic as possible given the state of technology and its evolution.

About the author
Arun Taneja is founder and president at Taneja Group, an analyst and consulting group focused on storage and storage-centric server technologies.

Dig Deeper on Data reduction and deduplication