Published: 15 Jun 2003
There are no cookie-cutter business continuity implementations. If disaster strikes, each organization has its own unique requirements to make sure its business can continue.
|Outside the disaster horizon|
A key requirement of business continuity is that there's sufficient distance between data storage locations so that a disaster that wipes out one site is unlikely to wipe out another site.
In fact, an organization should consider creating several types of business continuity solutions so it can respond to different types of risks. This article and a second one being published next month will look at some of the issues, technologies and products to keep your business afloat in case of a disaster.
Traditional disk mirroring offers a familiar method for business continuity that assumes low latency and high bandwidth connections. This can be done with dark fiber or dense wavelength division multiplexing (DWDM). DWDM is a technology that puts data from different sources together on an optical fiber, with each signal carried at the same time on its own separate light wavelength. Unfortunately, costs for those solutions may be higher than many budgets allow.
Alternatively, store-and-forward solutions in disk subsystems have been successful in providing both short- and long-distance business continuity. In all cases, synchronous acknowledgements should be considered over asynchronous until it's obvious that it will impact host system and/or application performance.
Local and remote storage
Spanning extended distances means that the assumptions about bandwidth and cost must be reevaluated. MAN and WAN networks used for business continuity are usually much slower and/or much more expensive than the buses and networks used for local storage. In addition, it's difficult to maintain low latency communications to remote storage. Depending on the distance involved and the application's latency requirements, there may be several compromises involved in building a business continuity solution.
In general, there are no hard rules for what distance qualifies as remote storage. One organization could consider remote storage to be five miles away, while another organization might consider it to be 100 miles away. In fact, it's highly likely that different applications and platforms could have different definitions for remote storage distances based on the relative importance of the data and bandwidth costs.
It's not necessary to dedicate a whole storage subsystem for remote storage. Selected resources--such as certain LUNs exported from a disk subsystem--can be used as remote storage. For example, an organization might have two data centers in two different geographies where some of the storage resources on disk subsystems in both locations would be used for remote storage by the other data center.
| Disadvantages of disk
mirroring for disaster recovery
There are essentially two different techniques for creating redundant copies of data for business continuity purposes. The first is host-based disk mirroring and the second is subsystems-based store and forward. Disk mirroring is one of the most basic and common forms of data redundancy protection. The concept of disk mirroring is simple: For every storage I/O created in a host system, two identical I/Os are sent to different disk targets. If one of the disks fails, the system can continue to work with the disk that's still functioning. Host-based disk mirroring has been implemented in several different kinds of products, including operating system software, volume management software, device driver software for host adapters, hardware chips on host adapters and in storage subsystem controllers.
Disk mirroring can provide performance benefits by reading overlapped I/O reads across both local disk targets. For example, a group of read I/Os can be sent to one disk target while another group of read I/Os could be sent to the other in such a way as to minimize the total seek time, and hence latency, on both drives. The benefits of overlapped read I/Os depend on low latency, short-distance local storage connections. While overlapping reads could be deployed in a business continuity environment, it would be counter-productive to send any reads to a slow remote connection with much higher latency. It may be possible to configure the disk mirroring product to not use overlapped reads, depending on the vendor and product.
Besides the possibility of overlapped reads impacting performance, there are two other primary problems with using disk mirroring for business continuity. The first problem is that disk mirroring typically has been implemented to work over fast, short links. When longer, slower links are used, host system performance will be adversely affected, and it's possible that disk timeouts could occur, resulting in the mirroring operator taking the remote disk target offline.
The other shortcoming of disk mirroring is that I/Os are sent to both disk targets, as opposed to just the writes. For many applications, the ratio of reads to writes is approximately three to one. That means disk mirroring takes up a fair amount of the available, expensive bandwidth. Not only that, but for the purposes of business continuity, the only I/Os of interest are writes because reads create no new data to copy to remote storage. The situation is upside down--reads dominate I/O activity and take most of the bandwidth, although there's no requirement to transmit them over a long-distance connection.
|Advantages of store and
forward for disaster recovery
You'd want to use disk mirroring for business continuity when the long-distance connection resembles a local connection--when it's fast and has low latency. One scenario is when the distance is short enough for dark fiber (private fiber optic cable) between local and remote storage. For example, people often quote Fibre Channel's (FC) 10 km-supported distances over single-mode fiber optic cables. There are many FC users who implement remote storage this way because the I/O performance problems discussed above are alleviated. Access to the remote disk target would be practically the same as to local disk targets.
In addition to using dark fiber, you can use DWDM optical networking technology to span the distance between local and remote storage. Business continuity traffic over DWDM is similar to dark fiber by virtue of being a high bandwidth, low latency connection. The main advantage of DWDM is its ability to span greater distances than dark fiber and use services provided by public network service providers.
In a nutshell, DWDM provides the underlying optical network transport for carrying virtually any kind of physical network, including FC, FICON and Gigabit Ethernet. Business continuity is one of the fastest growing applications that takes advantage of this incredible technology. (See "DWDM can connect distant Fibre Channel nets")
Another option for extending high bandwidth connectivity is to use special-purpose optical line drivers that support distances in excess of 30 km and can reach up to 100 km through the use of repeaters. This technology has ample bandwidth, but can be expensive to implement unless you happen to already own the right of way to string fiber cables over that distance. The latencies are higher due to the increased distances, but there are many applications that won't notice the impact of disk mirroring over this type of cable plant.
As high-speed optical networking capabilities are brought to the market along with improvements in switching latency, disk mirroring will become more viable for business continuity than it is today. Until then, most organizations will find it more advantageous to use a store-and-forward method.
|When two remote sites make sense|
Distant remote sites protect against major catastrophes, but are awkward for more frequent local disasters, such as a building fire or flooding. A nearby mirrored site is better for that, and it can be copied asynchronously to a distant site for full protection.
Store and forward
Disk mirroring was designed to solve failures in local disk drives; store-and-forward storage was designed specifically for business continuity applications. The basic operation of store-and-forward storage is for a disk subsystem to receive a write command, write the data to disk and then retransmit the data to a remote disk target.
Most of the store-and-forward products are proprietary and sold by the major subsystem vendors. EMC has Symmetrix Remote Data Facility (SRDF), Hitachi Data Systems has TrueCopy, and IBM sells Peer-to-Peer Remote Copy. The market has placed a high value on these solutions--and with good reason, because they have done what was expected of them under the most adverse conditions. The amazing business continuity successes in the aftermath of Sept. 11 are a tremendous testimony to how well these products perform.
With store-and-forward solutions, the host system only generates a single write I/O and the subsystem does the rest of the work. For example, assume a specific LUN address on a given target address is the data access point for a mission critical line of business application. The subsystem commits write I/Os to its internal disk target and then creates an entry in a FIFO queue in memory or on reserved disk storage. The data is then forwarded from the local subsystem to the remote one in the order it was received. The sending subsystem manages the transmission details including acknowledgements and any error recovery. This error recovery involves maintaining detailed knowledge of its forwarding operations over an extended period of time during periods of communication failures.
With store-and-forward storage, there's usually another non-storage area network (SAN), MAN or WAN network. Gateway systems at both the local and remote sites connect the SAN storage equipment over the non-SAN network. The gateway serves two key purposes: It provides address transparency and performs detailed communication operations on the non-SAN network. Both storage subsystems use their native addressing and methods modes without needing to know anything about the foreign network's addressing and methods. Most store-and-forward implementations have used proprietary gateway technology, although it's expected that FC/IP will become a standard with widespread deployments for business continuity in the near future.
The interrelated issues of distance, bandwidth and cost apply equally to store-and-forward implementations as they do for disk mirroring. However, with store and forward, there's no problem with read I/Os taking the lion's share of the available MAN or WAN bandwidth. This means a store and forward solution can be accomplished with a small percentage of the bandwidth needed for disk mirroring. If the read to write ratio is 3:1, then store and forward only needs 25% of the bandwidth required for disk mirroring.
|DWDM can connect distant Fibre Channel nets|
Dense wave-division multiplexing provides the same kind of distance capabilities as dark fiber. But it also provides a variety of public net services on top of that.
In general, storage I/O latency needs to be sufficiently low to maintain acceptable system performance levels. After a system issues an I/O command, it waits for an acknowledgement to be returned from storage before issuing the next I/O command. If there are delays in the transmission of the I/O command or its acknowledgement from storage, the performance of the system can suffer.
This obviously is an important consideration for business continuity solutions that have to balance cost, performance and data protection. Store-and-forward solutions deal with this by managing the acknowledgements of storage I/Os as either synchronous or asynchronous.
Synchronous acknowledgements are issued from remote storage after the copy of the original write data has been received and written to disk. The main benefit of synchronous acknowledgements is knowing precisely what data has been received by remote storage. As each and every write I/O is acknowledged, there's no ambiguity as to the state of data on remote storage. The primary disadvantage of synchronous acknowledgements is that they can be relatively slow and introduce significant latency into I/O processes.
Asynchronous acknowledgements are issued from local storage after the I/O has been committed to local storage. Asynchronous has the opposite set of characteristics from synchronous acknowledgements. The main benefit of asynchronous acknowledgements is the lowest possible latency. The primary disadvantage of asynchronous acknowledgements is not knowing what data made it safely to the remote storage subsystem. This means that there's likely to be some amount of repair work to do with the data on remote storage, if it ever needs to be called in to use after a disaster strikes the local site.
Creating multilevel solutions
To design a business continuity strategy, it may make sense to establish two different disaster radii. The first would be to recover from disasters that may take out a local site, but without impacting much of the surrounding area. The second would be a much larger radius that would place remote storage a large distance from local storage.
This approach of using two remote storage sites has been adopted by the financial services industry to meet their needs for responding to different types of disasters as quickly as possible. To differentiate between the remote storage sites, we'll refer to them as nearby remote storage and distant remote storage. The rational for having nearby remote storage is to be able to react to the disaster with the team that lives and works in the area. The logistics of transporting key skilled IT workers hundreds of miles following a major disaster has its own set of problems that can be avoided with nearby remote storage.
Let's assume a business wants to create a two-level business continuity solution. The company has two buildings that are located approximately five miles apart and are connected by dark fiber. They plan to use disk mirroring between systems in both buildings to keep a copy of the data on nearby remote storage in the other building. In one of their buildings, they employ store and forward technology over an existing T-3 WAN to send data to a remote storage subsystem. Synchronous acknowledgements are used between nearby storage and distant storage subsystems to ensure complete data transmissions to the distant remote storage subsystem. Asynchronous acknowledgements would also work in this case, depending on whether or not the bandwidth savings would justify the change (see "When two remote sites make sense").