If you have been following Part 1 and Part 2 on my series on Demystifying VMware data protection, up to this point the methods and applications I have presented have been focused around the restoration of local files or virtual machines (VMs). In the event of a large-scale ESX server(s) failure or data center disaster, local restore capabilities, especially those utilizing tape, are simply insufficient if application recovery time objectives (RTOs) and recovery point objectives (RPOs) are on the order of minutes to hours (which is typical of mission-critical applications). Such stringent recovery objectives require a more real-time method of protecting data.
Data recovery vs. restore
Before going into data replication as a data protection method, I have to explain the differences between a data recovery operation and a data restore operation. I define recovery as the ability for a server or application to access an alternate copy of its data immediately or nearly immediately. Recovery time and RPOs range from instantaneous to minutes.
I define restore as the process of having to copy back data to a specific location before the server or application can access it. Recovery time and RPOs usually range from hours to days.
Thus, it is very typical to see both recovery technologies and restore technologies employed in order to provide the full spectrum of recoverability.
Host-based vs. storage-based replication
Host-based replication refers to the process of replicating data from within the application to an alternate copy of the application. This allows both copies of the application to be aware of the replication and can usually provide up to the last transaction recovery point capability. However, the software utilized for the replication operation must be installed on every instance of every type of application that you wish to protect this way.
With storage-based replication, the storage upon which the data is being written to is also building a copy of the information at an alternate location -- either locally or remotely. Like host-based replication, this approach can provide up to the last transaction recovery point capability, but with the advantage of being able to protect all applications with the same common replication method. However, application integrity can't be guaranteed with this method without the use of specialized agents that communicate between the application and storage in order to get a consistent point-in-time copy.
Local storage-based replication
There are a number of vendors that provide local storage array recovery technologies that can be utilized to augment standard backup and restore software and procedures you may already have in place. These technologies allow you to do local onsite recovery of data without having to restore it back to its original location. Examples include:
- EMC Corp. SnapView and EMC RecoverPoint
- IBM Corp. FlashCopy and IBM Volume Copy
- Hewlett-Packard (HP) Co. StorageWorks Business Copy
- LSI Corp. StoreAge MultiView and StorageAge MultiCopy
- NetApp Inc. Snapshot
Each technology described here can provide, at the minimum, crash-consistent recovery of a VM. This can provide an application recovery point down to minutes and recovery time down to seconds, depending on manufacturer and implementation.
To obtain recovery that is better than crash-consistent (i.e., application-consistent), some of the above technologies provide specialized agent-based software that ensures application integrity during the snapshot process. For example, EMC Replication Manager and NetApp SnapManager provide this capability when integrated with their storage arrays. EMC's RecoverPoint CDP is a bit of an exception because it is application-aware; it creates application-consistent "bookmarks" at key points in time based on context. The other products mentioned here require scripts to be written.
But if specialized storage software is used, then there is a significant implication for VMware from a storage standpoint: Each technology (at least as of this writing) requires that the applications within each VM have direct access to storage volumes. In other words, a VM's application must directly own its volumes, either through the use of VMware's Raw Device Mapping, an iSCSI LUN or an NFS mount. The application can't reside on a VMFS volume as would be the traditional approach to provisioning storage for VMware.
This is because all reads and writes to the volume must be under the direct control of the replication software so that I/O operations can be suspended and writes can be flushed to the volume. If the volumes were to reside on a VMFS, which has its own cache/buffering, then there would be no guarantee that all I/O activity undertaken would have been committed fully down to the storage hardware. Also, if a snapshot of a VMFS volume were done, and if other virtual machine disk files for other unrelated virtual machines were residing on the VMFS volume, then they would be snapped/cloned as well, regardless of their state or whether it was necessary for them to be snapped/cloned.
So, in a nutshell, any application volumes that will be snapped through the use of application-level snapshot software must have their volumes isolated and provisioned directly to them in order to guarantee an application-consistent snapshot/clone.
Remote storage-based replication
In addition to local recovery capabilities, most storage system manufacturers offer remote recovery capability. This is where data is replicated from a local onsite storage system to a remote or offsite storage system. Some technologies are synchronous in nature -- where the remote data is a real-time copy of the local data. Others are asynchronous in nature, where the remote data is not real-time, but rather is a delayed copy of the data, which usually reflects a specific point in time. Synchronous data replication has far more stringent network speed and latency requirements (read: expensive networking) than asynchronous, thus making the latter far more popular.
Some examples of replication technology include: EMC's MirrorView SAN Copy and RecoverPoint CRR; IBM Enhanced Remote Mirroring; LSI StoreAge MultiMirror; and NetApp SnapMirror and SnapVault.
For asynchronous replication, most manufacturers usually leverage their local snapshot/clone capabilities to obtain a consistent point-in-time copy of the data, in turn replicating this point-in-time copy to the remote storage system. In this scenario, all the same requirements related to provisioning storage volumes directly to the VMs that were presented for local storage-based replication also apply to asynchronous replication. Unless the volumes are owned directly by the VM, the integrity can't be guaranteed for the replicated copy of the data at the remote storage system. On the plus side, it is possible to have application-consistent copies of the data at the remote site by utilizing the aforementioned specialized agent-based software that integrates with the storage and its replication capabilities.
Synchronous replication, although more expensive, is actually simpler. Because every write that occurs at the local storage system is replicated to the remote storage system, there is no need to be concerned about trying to obtain a consistent point-in-time copy of the application's data. However, the replicated copy at the remote storage system will be, at best, crash-consistent, forcing you to rely more heavily on the application's built-in recovery capabilities. But you'll have a copy of every bit and byte of info that was committed to the local system.
Just like how asynchronous replication is more popular then synchronous replication due to cost, more and more people are turning to replicating data that has been backed up to disk as a remote offsite recovery solution, or should I say, remote offsite restore solution.
More and more I've been finding that many small- to medium-sized businesses don't have very stringent recovery objectives in the event of a site failure or disaster. Typical recovery objectives have ranged from multiple hours (12 to 24 hours) to a couple of days.
This is the sweet spot for backup to disk. There are many products that have gained ground in this space. Some of the more popular backup-to-disk storage systems are the products that incorporate Data deduplication.
In addition to storage consumption savings, another huge benefit is replication. Most hardware based data deduplication storage systems have replication built into them. When these devices replicate, only the unique blocks are transferred between the devices. This allows for a significantly smaller WAN link to be provisioned (rule-of-thumb has been 5% of what you would have normally required) making backup to disk with replication for disaster recovery a viable, cost-effective and more secure alternative to backup to tape that has to be trucked offsite.
Some examples of backup-to-disk storage system hardware that incorporate data deduplication with replication are Data Domain's DDR series, EMC's DL3D series and Quantum Corp.'s DXi Series. These products are hardware storage systems that deduplicate once the data is sent to them. This is also called "target-based data deduplication."
There are many others products with "source-based data deduplication," where the data is deduplicated before it is sent across the network. These are usually software-based and examples include EMC Avamar and Symantec Corp. Veritas NetBackup Pure Disk. In fact, Avamar can be deployed as a VM instead of a physical server.
Data deduplication has been a game-changing technology. It's not a stretch to say that data deduplication is to backup/restore what VMware has been to server consolidation, thus it is not surprising that they go hand-in-hand.
I've seen real-world deduplication ratios as high as 240:1 when virtual machine disk files are backed up using products such as IBM Tivoli Storage Manger (TSM), Symantec Veritas NetBackup and VizionCore vRanger. Add to this an alternate site with a second storage device and a second backup server combined with deduplicated replication, and you now have disaster restore capability at bargain basement prices compared to the replication offered by primary storage systems.
Remote recovery: Host-based replication
If you want the ultimate in remote application consistent recovery, host-based replication that allows the most level of granularity from a recovery point standpoint and the most rapid RTO.
This form of replication is essentially sets up an application in a "geo-cluster" type of configuration. Where the local and remote hosts act as active-passive nodes of a cluster, but stretched across a geographical distance. This technology involves the installation of specialized software that supports the specific application you're wishing to replicate or geo-cluster.
The unique aspects of this replication is that the replication is undertaken by the specialized software within the host, not the storage; and it is application-aware, so it is able to undertake failover almost instantaneously to the passive remote node.
Some are examples of software that provide agents for most commonly used applications such as Exchange, SQL Server, SharePoint, Domino, and BlackBerry Enterprise Server include Double-Take Software, Neverfail Software and Microsoft Cluster Services.
So how does this apply to VMware? All the above software works with VMware VMs. In other words, if the application you wish to protect resides within a VM, then the software can be installed within a virtual server in addition to a physical server. There is a two-fold bonus to utilizing host-based replication within VMware.
The is that not only can the software replicate virtual-to-virtual or physical-to-physical, but also physical-to-virtual. This means that for production, you can have a physical server assigned to your application for performance reasons, but in the event of server failure or disaster, a VM can take over at the recovery site, thus reducing infrastructure costs at the recovery site albeit at the cost of some performance.
The second is that no special provisioning of storage is required for the VM to utilize host-based replication. The replication software resides within the context of the virtual machines operating system and all the way through to the application being protected; therefore, it does not matter if the volumes provisioned are RDMs, iSCSI LUNs or virtual machine disk files residing on a VMFS volume.
If continuous application availability, even in the event of total site failure, is your desired protection with the flexibility of utilizing any VMware supported volume provisioning method and with any combination of virtual and physical server infrastructure, then host-based replication is for you.
All of the technologies I've presented here have their advantages and disadvantages. In reality, it is very common to end up having a combination of technologies, each suited for a different level of recoverability.
For example, one can utilize a host-based replication product for the most mission-critical applications that have the most stringent recovery requirements. For other less critical applications, which still require rapid recovery capability, local storage-based snapshots can be utilized with asynchronous replication to your recovery site. And for applications that require a rapid restore capability, you can back up to deduplicated disk and replicate to the recovery site, which can also be your fall-back technology when all else fails.
OK, I know, it's still not a simple picture; but I would argue that super-stringent recoverability rarely is plug-and-play. Challenges will always exist as long as application developers continue to write code without the necessary parallelism required for the application to have the capacity to be natively highly available or replication capable. On a side note, I should acknowledge that with Exchange's on-board replication capabilities, Microsoft might be setting a trend to fix this. But that's another story.
About the author: Ashley D'Costa architects and designs advanced computer solutions and has technical experience with a broad spectrum of IT infrastructures.