Implementing data deduplication technology in a virtualized environment

More environments are showing an interest in implementing data deduplication in their virtualized environments. Find out what's driving this interest and what to watch out for.

More and more businesses are showing an interest in implementing data deduplication technology in their virtualized environments because of the amount of redundant data in virtual server environments.

In this Q&A with Jeff Boles, senior analyst with the Taneja Group, learn about why organizations are more interested in data dedupe for server virtualization, whether target or source deduplication is better for a virtualized environment, what to watch out for when using dedupe for virtual servers, and what VMware's vStorage APIs have brought to the scene. Read the Q&A or listen to the MP3 below.

Listen to the data deduplication in virtualized environments FAQ

Table of contents:

>>  Have you seen more interest in data deduplication technology among organizations with a virtualized environment?
>>  Is source or target deduplication being used more? Does one have benefits over the other?
>>  Does deduplication introduce any complications when you use it in a virtual server environment?
>>  Are vendors taking advantage of vStorage APIs for Data Protection?

 Have you seen more interest in data deduplication technology among organizations that have deployed server virtualization? And, if so, can you explain what's driving that interest and the benefits people might see from using dedupe when they're backing up virtual servers?

Absolutely. There's lots of interest in using deduplication for virtualized environments because there's so much redundant data in virtual server environments. Over time, we've become more disciplined as IT practitioners in how we deploy virtual servers.

We've done something we should've done a number of years ago with our general infrastructures, and that's creating a better separation of our core OS data from our application data. And consequently, we see virtualized environments that are following best practices today with these core OS images that contain most operating system files and configuration stuff. They separate that data out from application and file data in their virtual environments, and there are so many virtual servers that use very similar golden image files with similar core OS image files behind a virtual machine. So you end up with lots of redundant data across all those images. If you start deduplicating across that pool you get even better deduplication ratios even with simple algorithms than you do in a lot of non-virtualized production environments. There can be lots of benefits from using deduplication in these virtual server environments just from a capacity-utilization perspective.

 What kind of data deduplication is typically being used for this type of application? Do you see source dedupe or target, and does one have benefits over the other? 

There are some differences in data deduplication technologies today. You can choose to apply it in two places -- either the backup target (generally the media server), or you can choose to apply it at the source through the use of technologies like Symantec's PureDisk, EMC Avamar or some of the other virtualization-specialized vendors out there today.

source deduplication is being adopted more today than it ever has before and it's particularly useful in a virtual environment. First you have a lot of contention for I/O in a virtualization environment, and what you see when you start doing backup jobs there. Generally, when folks start virtualizing, they try to stick with the same approach, and that's with a backup agent that's backing up data to an external media server to a target, following the same old backup catalog jobs, and doing it the same way they were in physical environments. But you end up packing all that stuff in one piece of hardware that has all these virtual machines (VMs) on it, so you're writing a whole bunch of backup jobs across one piece of hardware. You get a whole lot of I/O contention, especially across the WANs, and more so across LANs. But any time you're going out to the network you're getting quite a bit of I/O bottlenecking at that physical hardware layer. So the traditional backup approach ends up stretching out your backup windows and messes with your recovery time objectives (RTOs) and recovery point objectives (RPOs) because everything is a little slower going through that piece of hardware.

So source deduplication has some interesting applications because it can chunk all that data down to non-duplicate data before it comes off the VM. Almost all of these agent approaches that are doing source-side deduplication push out a very continuous stream of changes. You can back it up more often because there's less stuff to be pushed out, and they're continually tracking changes in the background; they know what the deltas are, and so they can minimize the data they're pushing out.

Also, with source-side deduplication you get a highly optimized backup stream for the virtual environment. You're pushing very little data from your VMs, so much less data is going through your physical hardware layer, and you don't have to deal with those I/O contention points, and consequently you can get much finer grained RTOs and RPOs and much smaller backup windows in a virtual environment.

 Does data deduplication introduce any complications when you use it in a virtualized environment? What do people have to look out for?

When you're going into any environment with a guest-level backup and pushing full strings of data out, you can end up stretching out your backup windows. The other often-overlooked dimension of deduplicating behind the virtual server environment is that you are dealing with lots of primary I/O that's pushed into one piece of hardware now in a virtual environment. You may have many failures behind one server at any point in time. Consequently, you may be pulling a lot of backup streams off of the deduplicated target or out of the source-side system. And, you may be trying to push that back on the disk or into a recovery environment very rapidly.

Dedupe can have lots of benefits in capacity but it may not be the single prong that you want to attack your recovery with.



Dedupe can have lots of benefits in capacity but it may not be the single prong that you want to attack your recovery with because you're doing lots of reads from this deduplicated repository. Also, you're pulling a batch of disks simultaneously in many different threads. There may be 20 or 40 VMs behind one piece of hardware, and you're likely not going to get the recovery window that you want -- or not the same recovery window you could've gotten when pulling from multiple different targets into multiple pieces of hardware. So think about diversifying your recovery approach for those "damn my virtual environment went away" incidents. And think about using more primary protection mechanisms. Don't rely just on backup, but think about doing things like snapshots where you can fall back to the latest good snapshot in a much narrower time window. You obviously don't want to try to keep 30 days of snapshots around, but have something there you can fall back to if you've lost a virtual image, blown something up, had a bad update happen or something else. Depending on the type of accident, you may not want to rely on pulling everything out of the dedupe repository, even though it has massive benefits for optimizing the capacity you're using in the backup layer.

 Last year VMware released the vStorage APIs for Data Protection and some other APIs as a part of vSphere. Are you seeing any developments in the deduplication world taking advantage of those APIs this year?

The vStorage APIs are where it started getting interesting for backup technology in the virtual environment. We were dealing with a lot of crutches before then, but the vStorage APIs brought some interesting technology to the table. They have implications for all types of deduplication technology, but I think they made particularly interesting implications for source-side deduplication, as well as making source-side more relevant. One of the biggest things about vStorage APIs was the use of Changed Block Tracking (CBT); with that you could tell what changed between different snapshots of a VM image. Consequently, it made this idea of using a proxy very useful inside a virtual environment, and source-side has found some application there, too. You could use a proxy with some source-side technology so you can get the benefits of deduplicating inside this virtual environment after taking a snapshot, but it only deduplicates the changed blocks that have happened since the last time you took a snapshot.

Some of these vStorage API technologies have had massive implications in speeding up the time data can be extracted from a virtual environment. Now you can recognize what data has changed between a given point in time and you can blend your source-side deduplication technologies with your primary virtual environment protection technologies and get the best of both worlds. The problem with proxies before was that they were kind of an all-or-nothing approach. You use the snapshot, and then you come out through a proxy in the virtual environment through this narrow bottleneck that will make you do a whole bunch of steps and cause compromises with the way you were getting data out of your virtual environment.

You could choose to go with source-side, but you have lots of different operations going on in your virtual environment. Now you can blend technologies with the vStorage APIs. You can use a snapshot plus source-side against it and get rapid extraction inside your virtual environment, and a finer application of the deduplication technology that's still using source-side to this one proxy pipe, which mounts up this snapshot image, deduplicates stuff and pushes it out of the environment. vStorage APIs have a lot of implications for deduping an environment and blending deduplication technologies with higher performing approaches inside the virtual environment. And you should check with your vendors about what potential solutions you might acquire out there in the marketplace to see how they implemented vStorage APIs in their products to speed the execution of backups and to speed the extraction of backups from your virtual environment.

Dig Deeper on Data reduction and deduplication