Virtual tape library (VTL) data deduplication FAQ
John Merryman, services director for recovery services at GlassHouse Technologies Inc., discusses the pros and cons of virtual tape library (VTL) data deduplication and the state of the VTL deduplication market today in this Q&A. His answers are also available as an MP3 for you to download.
John Merryman, services director for recovery services at GlassHouse Technologies Inc., discusses the pros and cons of virtual tape library (VTL) data deduplication and the state of the VTL deduplication market today in this Q&A. His answers are also available below as an MP3 for you to download.
Listen to the VTL deduplication FAQ
>>The benefits of using a VTL for data deduplication
>>The drawbacks of using a VTL for data deduplication
>>Restores from inline or post-process deduplication
>>VTL vendors offering deduplication
>>Forward-referencing and backward-referencing deduplication
What are the benefits of using a VTL for data deduplication?
Another factor here is that there are various VTL interfaces. You have the classic VTL interface, in which the device, emulates a tape library using Fibre Channel (FC) connectivity. However, there are also devices that connect via CIFS or NFS. These are just different means for getting your data into a data deduplication device. At this point, it's less about the interface and more about your protocol and connectivity preferences. Do you want to use FC or do you want to use iSCSI? More and more, this is less about the actual functionality, and more about the performance and integration strategy that you want to take.
With the first-generation VTLs, that testing was limited, and a lot of customers really paid the price by learning about the technology in house and how it affected production environments. These customers really educated the vendors about how disk technology performed, scaled and integrated with the various backup products. Also, there were a lot of integrated or tape-staging-type of architectures where the VTL was writing off to tape. There were very few vendors out there at that time that really had a deduplication capability; there was Data Domain, EMC Corp.'s Avamar and Permabit Technology Corp., and there were bottlenecks in all of the designs out there.
Data deduplication is generally available in almost all of the second generation of VTLs out today, and if they are not generally available, then they are in beta testing. All of the innovative vendors are making a hard push into data deduplication. Things are getting better and faster and the vendors are overcoming some of the engineering challenges. There's also a big move away from tape staging. Users are now using hybrid approaches -- using current disk and tape assets with data deduplication in their architecture.
What are the drawbacks of using a VTL for data deduplication?
There are really no alternatives for disk-based backup outside of backup software options. For example, Symantec Corp. has PureDisk that offers a data deduplication function at the client. IBM Corp. is soon to release a similar function in its software product. Other products like Avamar or Asigra have a software capability. And while these have a lot of play in smaller production environments, they have limited reach into enterprise environments where scale and performance requirements are really high.
Another potential drawback is that, with the exception of a few of the vendors, the technology is less proven. In fact, some vendors have only had data deduplication in production for a matter of months. It's great that it works, but no one really has a long-term understanding of the risks associated with the technology. You'll hear a lot of arguments around hash collisions, for example, which are statistically based -- not based on actual data loss in production.
Also, if you look at what happened with the first generation of VTL products, most of the vendors approached this from a really simplistic point of view. They tell you that you can take their product and plug it into the backup environment and all your problems will go away. But, almost every customer I've dealt with has not had that experience. In many instances, the first-generation VTLs were pretty miserable because things didn't work because there were bottlenecks and backup software integration was not really considered as part of the picture.
So, when you look at the typical backup environment today, the fundamental design has been tweaked but not exactly overhauled since the initial deployment of VTLs. For many companies, a refresh of the design and technology is long overdue, and right now data deduplication is a really good catalyst to do this.
Can you talk about restores from inline or post-process deduplication?
A little caveat here is if your network and your clients can really keep up with some of the technologies that are out there today. Dedupe on the backend also needs to keep up with post-process. So if anyone has ever run a TSM environment, if you fall behind with your batch operations, say more than two days behind, it's really bad and you will struggle to keep up. That same paradigm really applies to deduplication. With post-process dedupe, it's great if you can write in a lot of data at fast speeds, but it's not great if your dedupe capability on the back-end can't keep up.
So, let's say you're writing in, but you can't write out to your dedupe environment before your next backup. What happens when your front-end cache builds up? That means you pretty much have to wait and try to catch up before your front-end is available or has enough capacity for the next inbound batch of backups.
So, realistically you're going to size for multiple days or some head room to grow into for your front-end cache so you don't run into that situation. But from a technology selection and design point of view, you've got push your vendors to make sure that they're going to design the back-end of post-processing to keep up with the front-end capabilities. Otherwise you're going to be buying a half-baked solution.
Another con for post-process is that you're going to have more physical resources because you have to a staging area and capacity on the back-end for deduped storage. But these are all tradeoffs, and honestly I've seen cases where each approach can be very good for large and small customers. So it varies on your requirements and what you to accomplish.
Are pretty much all of the VTL vendors now offering deduplication or is it still limited?
I think it's a fair statement to say that today most of the leading VTL vendors and a lot of the network-attached storage (NAS) vendors are offering deduplication. There are some incumbents and some newcomers. FalconStor Software and all of the FalconStor partners have FalconStor Single Instance Repository (SIR) options for deduplication. Data Domain has been in the game for a long time. Diligent Technologies Corp., same story. It just got bought by IBM and is still working with Hitachi Data Systems (HDS) and is offering deduplication with the VTL.
A few more that offer deduplication -- and there is a longer list than this -- include Sepaton, NEC Corp., NetApp and Quantum Corp.. NetApp is later this year and I think it will be available in Q4. If it's not on their product roadmap and companies are sticking with the first generation VTL story with no dedupe, they are very likely to fall behind in the market.