John Merryman, services director for recovery services at GlassHouse Technologies Inc., discusses the pros and cons of virtual tape library (VTL) data deduplication and the state of the VTL deduplication market today in this Q&A. His answers are also available below as an MP3 for you to download.
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
Listen to the VTL deduplication FAQ
>>The benefits of using a VTL for data deduplication
>>The drawbacks of using a VTL for data deduplication
>>Restores from inline or post-process deduplication
>>VTL vendors offering deduplication
>>Forward-referencing and backward-referencing deduplication
Well, there are a few big use cases right now that are driving data deduplication to the forefront of the decision-making that's going on in backup environments. The bottom line is data deduplication is dramatically changing the economics of using disk as a backup target as opposed to tape. You can only buy so much disk before you run out of funding, but there are a few other drivers. One is tape reduction -- that's the most obvious. Also, VTL deduplication provides an alternative to first-generation VTLs (in the 2003 to 2004 timeframe). And, the last is an encryption-avoidance strategy. Companies that are looking at media loss risk -- and are facing operational, litigation or reputation risk as a result -- have started to compare tape encryption, and how to do that with a tape reduction strategy that allows the company to use a deduplicating VTL and site-to-site replication. Other benefits include reducing the complexity of operations and requiring fewer hands-on at remote sites or even at core sites for tape management.
Another factor here is that there are various VTL interfaces. You have the classic VTL interface, in which the device, emulates a tape library using Fibre Channel (FC) connectivity. However, there are also devices that connect via CIFS or NFS. These are just different means for getting your data into a data deduplication device. At this point, it's less about the interface and more about your protocol and connectivity preferences. Do you want to use FC or do you want to use iSCSI? More and more, this is less about the actual functionality, and more about the performance and integration strategy that you want to take.
With the first-generation VTLs, that testing was limited, and a lot of customers really paid the price by learning about the technology in house and how it affected production environments. These customers really educated the vendors about how disk technology performed, scaled and integrated with the various backup products. Also, there were a lot of integrated or tape-staging-type of architectures where the VTL was writing off to tape. There were very few vendors out there at that time that really had a deduplication capability; there was Data Domain, EMC Corp.'s Avamar and Permabit Technology Corp., and there were bottlenecks in all of the designs out there.
Data deduplication is generally available in almost all of the second generation of VTLs out today, and if they are not generally available, then they are in beta testing. All of the innovative vendors are making a hard push into data deduplication. Things are getting better and faster and the vendors are overcoming some of the engineering challenges. There's also a big move away from tape staging. Users are now using hybrid approaches -- using current disk and tape assets with data deduplication in their architecture.
There are really no alternatives for disk-based backup outside of backup software options. For example, Symantec Corp. has PureDisk that offers a data deduplication function at the client. IBM Corp. is soon to release a similar function in its software product. Other products like Avamar or Asigra have a software capability. And while these have a lot of play in smaller production environments, they have limited reach into enterprise environments where scale and performance requirements are really high.
Another potential drawback is that, with the exception of a few of the vendors, the technology is less proven. In fact, some vendors have only had data deduplication in production for a matter of months. It's great that it works, but no one really has a long-term understanding of the risks associated with the technology. You'll hear a lot of arguments around hash collisions, for example, which are statistically based -- not based on actual data loss in production.
Also, if you look at what happened with the first generation of VTL products, most of the vendors approached this from a really simplistic point of view. They tell you that you can take their product and plug it into the backup environment and all your problems will go away. But, almost every customer I've dealt with has not had that experience. In many instances, the first-generation VTLs were pretty miserable because things didn't work because there were bottlenecks and backup software integration was not really considered as part of the picture.
So, when you look at the typical backup environment today, the fundamental design has been tweaked but not exactly overhauled since the initial deployment of VTLs. For many companies, a refresh of the design and technology is long overdue, and right now data deduplication is a really good catalyst to do this.
With inline deduplication, one of the pros is that when it's done, it's done. So when your data is written into the device, dedupe is accomplished and you keep on trucking. You may have less upfront performance compared to some of the post-process vendors because you're doing a lot more work as you're doing inline deduplication. But then another pro is that you could have less complexity and, essentially, time to accomplish everything.
So, if you look at the backup process' production and then backend things like dedupe, copies and replication in batch, you may be able to get a lot of your batch work done in less time by having a single step. In the post-process models you have multiple steps; you write into a staging area, then you read out to a dedupe area.
One of the bigger advantages of inline is that you can allow replication of a device from a source to a target and start right away. So with post-process there are some other things to think about. You can have a really fast initial write speed; extremely high performance and writing your data backups inbound. This can be very good from the point of view of meeting your backup window and accomplishing extremely fast restores from this device back to clients for the most recent data.
A little caveat here is if your network and your clients can really keep up with some of the technologies that are out there today. Dedupe on the backend also needs to keep up with post-process. So if anyone has ever run a TSM environment, if you fall behind with your batch operations, say more than two days behind, it's really bad and you will struggle to keep up. That same paradigm really applies to deduplication. With post-process dedupe, it's great if you can write in a lot of data at fast speeds, but it's not great if your dedupe capability on the back-end can't keep up.
So, let's say you're writing in, but you can't write out to your dedupe environment before your next backup. What happens when your front-end cache builds up? That means you pretty much have to wait and try to catch up before your front-end is available or has enough capacity for the next inbound batch of backups.
So, realistically you're going to size for multiple days or some head room to grow into for your front-end cache so you don't run into that situation. But from a technology selection and design point of view, you've got push your vendors to make sure that they're going to design the back-end of post-processing to keep up with the front-end capabilities. Otherwise you're going to be buying a half-baked solution.
Another con for post-process is that you're going to have more physical resources because you have to a staging area and capacity on the back-end for deduped storage. But these are all tradeoffs, and honestly I've seen cases where each approach can be very good for large and small customers. So it varies on your requirements and what you to accomplish.
I think it's a fair statement to say that today most of the leading VTL vendors and a lot of the network-attached storage (NAS) vendors are offering deduplication. There are some incumbents and some newcomers. FalconStor Software and all of the FalconStor partners have FalconStor Single Instance Repository (SIR) options for deduplication. Data Domain has been in the game for a long time. Diligent Technologies Corp., same story. It just got bought by IBM and is still working with Hitachi Data Systems (HDS) and is offering deduplication with the VTL.
A few more that offer deduplication -- and there is a longer list than this -- include Sepaton, NEC Corp., NetApp and Quantum Corp.. NetApp is later this year and I think it will be available in Q4. If it's not on their product roadmap and companies are sticking with the first generation VTL story with no dedupe, they are very likely to fall behind in the market.
Sure, I'll hit that first and then talk about the other aspect of dedupe; inline versus post-process. Basically, reverse referencing means that your newest backup as it's written into a dedupe environment is going to be within that environment made up of pointers that really go back to the older backups, maybe even back to the original pull if you have a full and incremental-forever approach.
Your newest backup as a result is really going to have a lot of pointers back to older data. So when you "redupe" the data back to its original state, you're going to be going back and pointing to lots and lots of pointers because you have a reverse-referencing-type methodology.
Forward referencing is kind of the opposite and completely different. There is only one vendor out there that does this and that's Sepaton. With forward referencing, the newest backup is maintained in its entirety. So, your old backups reference via pointers to the newest data.
As a result, when you do a restore with forward referencing, the newest backup is going to be readily acceptable and you don't have to do a lot of reconstitution or reduplicate that data. But when you have older backups that you need to restore and retrieve, you will have to do some reconstitution, so that will take longer.
When you look at a restore histogram for your average backup production environment, there's always a big peak around day one or two, and it goes up and up and then drops down around day seven. Then you have little spikes on the radar after that. So, most of your restores are usually happening within a week of the time the data was backed up.
Forward referencing from a technology point of view actually fits that histogram very well. If you think about the deduplication process, deduplication itself is definitely a CPU-intensive process and scale and performance are the big limiters. How much data can you crank through these devices, keeping in mind that backup is the most I/O-intensive application in the data center? You are definitely creating some CPU-intensive workloads when you're opening up the fire hose into a VTL device, so to speak.
You have the initial read, the analysis, comparisons, indexes, pointers and the writing of the data itself. Plus, the pointer or maybe just writing the pointer, but all that has to happen in a rapid fashion. So, how do the vendors do that? There a couple things. One is inline deduplication, where your data is inbound into a device and is being deduplicated in real time as it's written to usable storage. Post-process is a different methodology where your data is written into a disk cache that's non-deduplicated and later post-process is deduplicated and written out from your initial cache to a deduped area, kind of like a de-staging-type strategy.
A few of vendors, notably NetApp and Quantum, are really looking at this from a very interesting perspective of variable processing -- being able to toggle or configure respectively between inline and post-process, depending on CPU and workload.
John Merryman is responsible for service design and delivery worldwide. Merryman often serves as a subject matter expert in data protection, technology risk and information management related matters, including speaking engagements and publications in leading industry forums.
Dig Deeper on Disk-based backup