kras99 - Fotolia
Array-based data replication has always been considered the gold standard for creating the copies of data that are a key factor in a disaster recovery strategy. However, changes in the way applications are deployed today means data protection can be achieved in other ways, namely through software. Also, some of the issues and limitations of hardware-based replication can be overcome with software solutions that provide greater flexibility and more choice in protecting data.
Array-based replication has been around almost as long as the Integrated Cached Disk Array (ICDA), otherwise known as today's SAN storage, which combines disk, cache memory and microcode intelligence to deliver optimized I/O and replication from one location to another.
Features such as EMC's Symmetrix Remote Data Facility (SRDF) or Hitachi's TrueCopy marked the evolution of peer-to-peer copy techniques deployed in IBM's mainframe storage, which replicated data using hardware called the I/O subsystem. Array-based replication moved the "heavy lifting" work of replicating data away from the processor to be handled directly by the array. This process provided (and still provides) a number of distinct benefits:
- Workload offload. The array does the offload of all replication tasks, saving processor and memory resources on the host.
- Data consistency. The array itself is responsible for write-order integrity, ensuring that as data is copied to another location, updates are applied to guarantee consistency.
- Performance. The array is able to cache I/O in non-volatile memory (at both locations) and in doing so improve the performance of synchronous and asynchronous transmissions.
- Granularity. The array is able to replicate and manage the failover of a single LUN, a group of LUNs (typically called a consistency group) or the entire array.
- Data integrity. If any data is failed over (when operations are moved to a target array), the array is able to keep track of data changes and bring back only those changes to the primary site without needing a full re-sync. This is extremely important in full failover scenarios where the "reversion" process can mean running in an unprotected disaster recovery (DR) state for some time (when a full re-sync has to be performed).
- Optimization. Replication processes typically optimize the bandwidth between locations by replicating only changed data based on a block size within the underlying architecture, such as the track (harkening back to the mainframe days when everything was a track on a rotating disk) or block.
- Scale efficient. Failover at the array level allows data for many servers to be replicated and made available at a remote location in a very short time. This process scales well when combined with scripting, reducing the need for many storage administrators to be involved in the failover process.
Failover vs. failback
Failover: The process of invoking a move of primary data I/O operations from one array to another.
Failback: The process of returning I/O operations to the primary array.
The failover and failback process should maintain state for each location (where possible), enabling a return to normal operations with synchronization of only the changed data.
Despite those considerable benefits, array-based replication does have some important disadvantages:
- Cost. Replication can be an expensive solution with vendors typically charging for it by the terabyte, so it needs to be applied only to data that absolutely has to be replicated. Replication also requires expensive networks, which in many cases are implemented as dedicated Fibre Channel links, which again are expensive.
- Proprietary. Array-based solutions are proprietary to a specific vendor and don't offer cross-vendor support. The replication design typically makes use of the underlying physical architecture, moving data at a block or track level. That means data can't be moved from one vendor's gear to another's, creating lock-in for the user.
- Complexity. Array-based replication uses local non-volatile memory to cache I/O to improve performance. Value-add functions such as snapshots and clones also have to be managed within the framework of replication, which can lead to a complex configuration. This typically isn't a problem unless a failure scenario or software bug intervenes to upset the status quo.
- Crash copy. Without agents, array-based replication is unaware of the data itself, simply moving anonymous blocks between locations, albeit with write-order integrity. This means if failover is invoked, systems appear as if the server itself crashed, which may cause data corruption issues or elongated boot times as systems are restarted.
- Granularity. The minimum replication level is the LUN for block-based systems, which in virtual environments could encapsulate multiple virtual machines (VMs). Failing over a single LUN results in failover for all VMs, regardless of whether failover is required or not. (File-based systems can replicate at the file level, which offers better granularity.)
The options to array-based replication fall into a number of categories, depending on the specific problem being addressed.
Whatever happened to network replication?
In dedicated storage networks such as those built on Fibre Channel, replicating data in the network seems the most obvious choice; as data is passed to the array, it can be replicated elsewhere.
In practice, network-based replication had too many issues: data was replicated to another location before being confirmed on the local array, representing integrity problems; and the solutions required complex configuration and dedicated hardware. Security was also an issue.
As storage networks diversify, network replication will become less practical and likely finally fade into obscurity.
Using a dedicated replication appliance offloads the task of copying data between storage arrays to a separate piece of hardware or to software running on a physical or virtual server. A good example of this is EMC's RecoverPoint (from the acquisition of Kashya Inc.). The RecoverPoint software effectively sits in the data path, capturing each write I/O and copying it (with local protection through journaling) to an equivalent appliance at the remote site where the data is applied to the remote copy.
This process allows data going from and to unlike vendor arrays to be managed quite easily while maintaining data consistency, so it offers all the positive benefits of built-in array-based replication. However, many problems are not solved, especially those concerning data consistency as the solution is still effectively moving data at the array level.
Replication within the hypervisor
Eliminating some of the traditional elements involved with replication means getting closer to the application managing the data. For virtual systems, this means the hypervisor, such as VMware's vSphere or Microsoft's Hyper-V. The hypervisor understands the content of the data (at least as far as knowing it's a VM) and is therefore in a good position to manage the replication of data from one location to another at a very granular level (i.e., that of a VM itself rather than an entire LUN).
Both VMware and Microsoft offer replication within their hypervisor technology. VMware vSphere provides vSphere Replication, which is managed by vCenter System Recovery Manager. Hyper-V implements replication through Hyper-V Replica, managed through System Center Virtual Machine Manager (SCVMM).
Of course, the work of replication still has to be performed somewhere, and in this instance that responsibility is put back onto the hypervisor with the subsequent increases in processor, memory and networking required. These solutions aren't free either, so using the hypervisor for replication attracts a cost premium.
As with replication appliances, hypervisor replication offers the ability to place some data on lower cost storage or hardware from another vendor that doesn't need to provide the full features of expensive arrays. However, choosing cheaper and lower-performing storage for a target array could be a false economy, unless the workload profile is different (for example, test data not replicated is simply lost in a disaster).
What about application-based replication?
Ideally, replication within the application is the most logical solution, as the application understands the data itself. While there's a place for tools such as Oracle Data Guard or replication within Microsoft Exchange (among other examples), these solutions don't offer the same scalability as using array-based replication when implemented in large-scale server farms due to the effort involved in managing each replication pair. However application-based replication may be the ideal way of moving data in and out of public cloud infrastructures.
Replication within a VM
Rather than using the features of the hypervisor, another option is to use a VM to perform replication. This can be achieved in two ways. Some solutions use a VM to intercept data from virtual machines, while other replication apps configure a VM to act as a proxy datastore with the actual data stored on external or locally attached storage. The "datastore virtual machine" now sees all I/O traffic, allowing write I/O to be replicated to another location as well as committed to local storage.
Products such as Zerto Inc.'s replication for VMware use VMware APIs to monitor and intercept write I/O, replicating the data to another location using virtual appliances on the source and target hosts that Zerto calls Virtual Replication Appliances (VRAs). Atlantis Computing Inc.'s ILIO USX has the ability to pool storage resources for VMs from any type of storage and can replicate data between systems to ensure high availability.
VM-based replication consumes resources on the hypervisor and obviously needs to be carefully monitored as more virtual machines are added to a cluster. In addition, prioritization needs to be assigned to the replicating VM to ensure it continues to keep up with replication demand.
Replication in the LVM
Logical volume managers (LVMs) sit between physical storage resources and the logical representation of data in the form of LUNs and volumes. The LVM represents the perfect location to manage replication as it sees all I/O between the host and physical storage. Products such as DataCore's SANsymphony-V and StarWind Software Inc.'s Virtual SAN implement logical volume abstraction from physical storage resources, adding advanced features that include synchronous mirroring, asynchronous replication and near-continuous data protection functionality. These solutions add another layer of complexity into the design of a storage setup, but that may be acceptable to gain all the advanced features available.
Of course, there's a school of thought that would question the use of point-to-point replication as being last decade's technology. Object storage vendors are implementing data protection through simple mirroring or data dispersal techniques such as erasure coding (also known as forward error correction). Erasure coding algorithms geographically disperse multiple redundant fragments of data, allowing recovery from only a partial subset of those fragments. Data protection and replication is effectively built into the read and write process.
Poor performance would seem the obvious drawback of these solutions; however, companies such as Scality Inc. and Cleversafe Inc. are integrating flash into their architectures (both of which can be delivered as software-only offerings) to boost performance. Open source projects such as Ceph are building distributed data stores capable of storing block, file and object data using mirroring (data replicas) and erasure coding techniques. Although it's still early days for this technology, we can expect to see significant maturity achieved in the coming years now that Inktank (the company providing Ceph development and support) has been acquired by Red Hat Inc.
As a parting statement, we should acknowledge the increasing use of private and public cloud infrastructures. Software-based replication provides the opportunity to move data in and out of cloud infrastructures in a more practical way than could be achieved with array-based replication. Although bandwidth may make shifting entire VMs impractical, software-based solutions provide the greatest level of flexibility in offering data mobility, for private and public cloud solutions.
Some hardware vendors are providing solutions that place their hardware into colocation with cloud service providers. These are merely short-term fixes as we move to a heterogeneous world where software-based replication rules.
- Software-Defined Storage for Backup and Recovery –Hedvig Inc
- Comprehensive Data Backup and Recovery –Commvault
- How to Buy Backup and Recovery: A Customer's Evaluation –Rubrik
- Backup and Disaster Recovery for AWS Workloads –Veeam Software