Published: 14 Jul 2003
My first article in this two-part series discussed the basic concepts for data replication in the event of a disaster. (See "Cost-effective business continuance") The differences between host-based disk mirroring and subsystem-based store and forward were analyzed, and the article outlined how both technologies could be used together. This installment looks at data replication tools, including two new products that could potentially change the landscape in the next few years.
|Key remote data
Disasters are unexpected events that wreak havoc. IT organizations are increasingly being asked to assume the responsibility of finding technological methods to overcome the physical devastation that disasters cause. The whole point of using remote copy data replication technology is to provide a computing environment in which the impact of a disaster is as transparent and minimal as possible. In other words, remote data replication is expected to provide an organization with immediate access to data without first having to fix or repair the data. This means the remote site must have complete data integrity. If administrators have to spend hours troubleshooting data integrity problems at the remote recovery site, the value of the remote data replication product is significantly reduced.
One of the challenges in preparing for disasters is accommodating both instantaneous events and events that may take minutes--or even hours--to do their damage. In the case of instantaneous loss, the data that has been replicated to a remote site must be complete as is. Where databases are concerned, an incomplete transaction might have been replicated to the remote site, which would require it to be rolled back. This is fine, and should be fairly easy for a skilled DBA to recognize, as long as all of the transaction information has been accurately replicated.
A rolling disaster
A rolling disaster is a difficult scenario that takes place over an extended period of time. In this type of situation, systems continue to run after the onset of a disaster before eventually failing and, in this scenario, it's possible that the processing that occurs while the disaster is taking place may result in data errors that are replicated to the remote site. Some remote data replication products have been designed to address this type of problem, allowing an administrator to select the historical point in time when they want to stop applying replicated data. The selection of the cut-off time might involve more art than science, but it's easy to see why IT executives would want the additional safeguards in the form of products that offer that feature.
Remote data replication products all have a similar generic structure. A selection and queuing process is used to collect and stage the I/Os for transmission over a MAN or WAN; a transmission system is used to copy the I/Os to the remote site; and a remote writer is used to apply the I/Os at the remote site.
Queuing write I/Os
Remote data replication systems must collect write I/Os as they occur and prepare them for transmission over a network to a remote site. This process begins by analyzing every I/O instruction and selecting the writes. Each write instruction is reflected by an entry in a temporary data structure, such as a queue of actual write instructions or a list of pointers to pending write I/Os.
One brute force method of ensuring data integrity and write ordering at the remote site is to forward each and every write I/O that occurs on the local site. While this technique might not be the most efficient in terms of network bandwidth, it does have the advantage of being fairly simple to understand.
An alternative technique--one that saves network bandwidth costs with remote data replication--involves weeding out write I/Os from hot spots that are quickly overwritten with repetitive updates. For instance, a data hot spot may be written to several times in close succession. There's no need to replicate every I/O, as long as the final result is accurately recorded at the remote site.
Consider a scenario in which four storage blocks, A, B, C and D, are being updated by an application. Block A represents a hot spot that's updated whenever blocks B, C and D are updated. In a short period of time, block A could be updated three times while blocks B, C and D are each updated once. It's certainly possible to save transmission bandwidth by not sending the first two updates to Block A, and instead only send the third update together with the updates to blocks B, C and D. The trick to doing this is making sure that all four blocks are transmitted as a unit by the data replication product and written to remote storage as a granular unit. In other words, if all four blocks can't be written for some reason, then none of them should be until the problem is corrected. Notice that write ordering isn't violated this way, as long as all the blocks are applied successfully.
The problem with using this approach is that some write I/Os are temporarily delayed before being sent to the remote site. This means there is some potential to lose delayed I/Os if a disaster occurs. While this could be a serious problem for some applications, it's probably not for most.
One of the challenges with remote data replication is dealing with MAN or WAN problems--such as outages that prevent write I/Os from being transmitted to the remote site. In these situations, it's normal for the local process to write to an extended queue that "stocks up" write I/Os until the remote communications link can be restored. It's not necessarily rocket science to see how this could work, but obviously, if the remote communications link is down long enough, there's some likelihood that the data at the remote replication system could be pushed past the breaking point, and other techniques may be needed to synchronize data between the two sites.
Transmitting write I/Os
While most people associate remote replication with the end nodes involved in the process, the transmission technology plays a key role. For years, Minneapolis-based Computer Network Technology has been a leading vendor of networking equipment that connects local with remote storage subsystems running remote data replication applications.
Essentially, the transmission equipment is a network gateway that tunnels write I/Os from the local site to the remote site. Where storage area networks (SANs) are concerned, write I/Os are sent from the local data replication queue to the remote data replication gateway. The remote data replication gateway then encapsulates and segments the write I/Os for transmission over the MAN or WAN. On the receiving side, the gateway reassembles the write I/Os and sends them to the corresponding remote writer.
Just as the local queuing process may need additional storage space to hold pending writes when the remote communications link goes down, the transmission gateway may also provide its own write-pending queue to help manage temporary network outages.
As networking technologies continue to evolve, it's likely that more off-the-shelf networking products will be used for transmitting write I/Os. For instance, dense wavelength division multiplexing (DWDM) optical networking equipment could be used effectively to achieve remote data replication in MANs. However, given the distances for most MANs, it's possible that an organization would opt to use host-based disk mirroring instead of the expensive and complex store-and-forward data replication technologies.
In general, the easiest part of remote data replication is applying the writes on the receiving end. Once write I/Os are received, they need to be verified for accuracy and then written to the remote storage target.
However, I've already discussed two scenarios where more intelligence is required on the receiving end. The first involves the ability to select a safe cut-off time for applying writes after a rolling disaster. The second involves working with coalesced or collapsed I/Os that must be applied together to maintain proper write ordering.
An interesting nuance regarding writers involves acknowledging write I/Os that are committed to cache memory, as opposed to being written to non-volatile storage media. This isn't a problem as long as the cache is flushed to disk on a first-in, first-out basis.
|Why network replication makes sense|
Because replication is more communications-intensive than storage-intensive, network-based replication devices would be an efficient way to manage the process.
Using replicated data
Remote data replication systems need to be able to fulfill their primary purpose if a disaster strikes. The replication system will need to be switched from its receiving-writer mode to a live production mode. When this happens, the remote storage target will be used by a different production system.
Data replication may be implemented in such a way that the remote replicated volumes are established as read-only volumes. This is done to ensure data integrity--if the only writer to the volume is the replication writer, then there's no chance that external processes would inadvertently corrupt the replicated data. Obviously, before the replication storage target can be used for live I/O, it must be changed from read-only to read-write.
If the remote replication storage target is needed to come online for production purposes, it may be desirable to replicate new write I/Os to another remote site. This isn't the sort of thing that happens as a result of being lucky--it can only be done as a result of diligent disaster planning. Companies that have more than one data center will often replicate data between them. Many of the remote data replication products allow bidirectional operations, so that two sites can form a replication pair, protecting the information assets of both sites.
Another common use for replicated data is the creation of snapshots. In general, the remote storage target has far less activity than the primary storage target. This makes it much easier to perform snapshot operations. There are different methods for creating snapshots, which are beyond the scope of this article, but the general idea is to provide either historical, read-only access to replicated data or to generate additional copies of replicated data that can be used for analysis, testing and backup.
One technique that's sometimes used in remote data replication is time stamping. A data replication product that uses this approach attaches a system time stamp to each write I/O that's transmitted to remote storage. These time stamps can be incorporated in the algorithms of the data replication logic, or they can be used by an administrator who is managing the replication or failover process.
Just as virtual storage can be implemented in the host systems, storage subsystems and network devices, remote data replication can also be implemented in any of these locations along the I/O path.
Historically, remote data replication has been sold by enterprise disk subsystem vendors as an optional software offering. These subsystem-based solutions tend to be pricey, but customers who have used them will tell you that they were worth every penny.
On the negative side, one of the shortcomings with subsystem-based replication is that these solutions are homogeneous: They require similar subsystems from the same vendor on both ends of the connection. This means that you can't use cheaper storage from another vendor on the remote site. Disk subsystems aren't exactly open platforms encouraging third-party inventions and open-systems pricing. It's expensive for the subsystem vendors to develop for these platforms, and the task is an impossible proposition for anybody else. To raise the ante on the pricing front, the subsystem-based products tend to lock customers into their subsystem vendors for extended periods of time.
|Fresh approaches cut costs|
Architecturally, the main problem with subsystem- based replication is that the replication process is contained within a single subsystem. While this may not seem like such a big deal, restricting an important function within the confines of a particular resource is certainly a scalability limitation. In order to maintain write ordering and data integrity, the applications must be restricted to using only storage exported by a single subsystem. The reason for this is the lack of synchronization between subsystems performing remote data replication. If the systems, subsystems and all their communications are not all tightly coupled, there's no way to guarantee write ordering and data integrity.
One of the most compelling arguments for adopting SANs is the ability to connect virtually any storage resource to the network and make it available to almost any application. However, if the application requires availability insurance via remote data replication, subsystem-based replication severely restricts the flexibility that SANs were designed to deliver.
Another way to replicate remotely is to place replication software in the host system. There have been several companies, such as NSI, that have provided host-based file replication over the years. However, there has been only one vendor, Veritas with its Veritas Volume Replicator (VVR) that currently sells real-time store-and-forward remote volume replication that competes with subsystem-based remote data replication.
Host-based remote volume replication partially solves the big limitation of subsystem-based replication: To work together, all the subsystems must come from the same vendor. Volume replication at the host, on the other hand, can be established to work with nearly any storage target that the host can address in the SAN. However, the host software approach restricts the replication function to applications running on that particular host. Applying remote volume replication to multiple host systems may require a separate license for each system. Special cluster-aware volume replication that uses time stamps would be necessary to synchronize write I/Os between cluster nodes to ensure write ordering.
Host-based remote volume replication depends on the availability of system resources to manage the queuing and transmission of replicated write I/Os. This is much more resource-intensive than most volume management software functions. Organizations that want to use host-based remote volume replication should plan to oversize their systems to make sure they have both the disk capacity for queuing write I/Os (including situations where the remote communications link is down) as well as the CPU capacity for process replication functions. As these systems mature, unless additional disk and CPU resources can be added, the burden of host-based remote volume replication can become more noticeable.
Network device-based replication
The third architectural choice offered by SANs is to put the data replication function inside a network device, such as a router, switch or dedicated appliance. Unlike in-band storage virtualization, which I have viewed suspiciously for years, remote data replication may be an excellent application to run within the network. Whereas virtualization is primarily a storage function, remote data replication is mostly a communication function. There's very little storage work done with remote data replication, and the difficult aspects encompassing data integrity and write ordering can be addressed most directly with effective communications technology, not storage technology.
A dedicated system or appliance for remote data replication could be established and used by almost any application running on the server in the SAN. In addition, the remote data replication system or appliance could work with a wide variety of storage products, circumventing the vendor lock-in that exists with subsystem-based solutions today. The ability to manage data replication, including the failover process from a single management point, shouldn't be overlooked (see "Why network replication makes sense," above).
Furthermore, placing remote data replication in a dedicated system or appliance almost completely removes the burden of replication from host systems. While it would likely be necessary to include host write I/O redirectors that would mirror writes to a replication appliance, these redirection agents would likely be thin, and they would consume far fewer resources then a host-based volume replication system.
On the flip side, while there are many successful stories of subsystem-based remote data replication products, there are none yet for network device-based replication products. They seemed to have key architectural advantages, but they aren't yet proven in the market. (For an overview of remote data replication companies and their products, see "Key remote data replication companies" ).