By W. Curtis Preston
The Storage Networking Industry Association (SNIA) defines continuous data protection (CDP) "as a methodology that continuously captures or tracks data modifications and stores changes independent of the primary data, enabling recovery points from any point in the past … data changes are continuously captured … stored in a separate location … [and RPOs] are arbitrary and need not be defined in advance of the actual recovery."
Please note that you don't see the word "snapshot" above. While it's true that much of today's continuous data protection software allows users to create known recovery points in advance, they're not required. To be considered continuous data protection, a system must be able to recover to any point in time, not just to when snapshots are taken.
CDP systems start with a data tap or write splitter. Writes destined for primary storage are "tapped" or "split" into two paths; each write is sent to its original destination and also to the continuous data protection system. The data tap may be an agent in the protected host or it can reside somewhere in the storage network. Running as an agent in a host, the data tap has little to no impact on the host system because all the "heavy lifting" is done elsewhere. Continuous data protection products that insert their data taps in the storage network can use storage systems designed for this purpose, such as Brocade Communications Systems Inc.'s Storage Application Services API, Cisco Systems' MDS line and its SANTap Service feature or EMC Corp. Clarion's built-in splitter functionality. Some CDP systems offer a choice of where their data tap is placed.
Users then need to define a consistency group of volumes and hosts that have to be recovered to the same point in time. Some continuous data protection systems allow the creation of a "group of groups" that contains multiple consistency groups, creating multiple levels of granularity without sacrifice. Users may also choose to perform application-level snapshots on the protected hosts, such as placing Oracle in backup mode or performing Volume Shadow Copy Service (VSS) snapshots on Windows. (Remember, snapshots aren't required.) Some CDP systems simply record these application-level snapshots when they happen, while others provide assistance to perform them. It's very helpful when the continuous data protection system maintains a centralized record of application-level snapshots, as they can be very useful.
Each write is transferred to the first recovery device, which is typically another appliance and storage array somewhere else within the data center. This proximity to the data being protected allows the writes to be either synchronously replicated or asynchronously replicated with a very short lag time. Even if a continuous data protection system supports synchronous replication, most users opt for asynchronous replication to avoid any performance impact on the production system. A CDP system may support an adaptive replication mode where it replicates synchronously when possible, but defaults to asynchronous during periods of high activity.
Even if a continuous data protection system supports synchronous replication, most users opt for asynchronous replication to avoid any performance impact on the production system.
The data is stored in two places: the recovery volume and the recovery journal. The recovery volume is the replicated copy of the volume being protected and will be used in place of the protected volume during a recovery. The recovery journal stores the log of all writes in the order they were performed on the protected volume; it's used to roll the recovery volume forward or backward in time during a recovery. It may also be used as a high-speed buffer where all writes are stored before they're applied to the recovery volume. This design allows the recovery volume to be on less-expensive storage as long as the recovery journal uses storage that is as fast as or faster than the protected volume.
Once data has been copied to the first recovery device it can then be replicated off-site. Due to the behavior of WAN links, the CDP system needs to deal with variances in the available bandwidth. So it has to be able to "get behind" and "catch up" when these conditions change. With some systems you can define an acceptable lag time (from a few seconds to an hour or more), which translates into the RPO of the replicated system. The CDP system sends all of the writes that happened as one large batch. If an individual block was modified several times during the time period, you can specify that only the last change is sent in a process known as "write folding." This obviously means that the disaster recovery copy won't have the same level of recovery granularity as the on-site recovery system, but it may also mean the difference between a system that works and one that doesn't.
Modern continuous data protection systems also offers a built-in, long-term storage alternative. You can pick a short time range (e.g., from 12:00:00 pm to 12:00:30 pm every day) and tell the CDP system to keep only the blocks it needs to maintain only those recovery points, and to delete the blocks that were changed in between. Users who take application-level snapshots typically coordinate them to coincide with their recovery points for consistency purposes. This deletion of extraneous changes allows the CDP system to retain data for much longer periods of time. For longer retention periods, it's also possible to back up one of these recovery points to tape and then expire it from disk. Many companies use all three approaches: retention of every change for a few days, hourly recovery points for a week or so, then daily recovery points after that, followed by tape copies after 90 days or so.
Continuous data protection and recoveries
The true wonder of continuous data protection is how it handles a recovery. A CDP system can instantaneously present a LUN to whatever application needs to use it for recovery or testing, rolled forward or backward to whatever point in time desired. (As noted, many users choose to roll the recovery volume back to a point in time when they created an application consistent image. Although this means they'll lose any changes between that point in time and the current time, many prefer rolling back to a known consistent image rather than going through the crash recovery process.)
The true wonder of continuous data protection is how it handles a recovery.
Depending on the product, the recovery LUN may be the actual recovery volume (rolled forward or backward), a virtual volume designed mainly for testing a restore, or something in the middle where the recovery volume is presented to the application as if it has already been rolled forward or backward, when in reality the actual rolling forward or backward is happening in the background. Some systems can simultaneously present multiple points in time from the same recovery volume.
Once the original production system has been repaired, the recovery process is reversed. The recovery volume is used to rebuild the original production volume by replicating the data back to its original location. (If the system was merely down and didn't need to be replaced, it's usually possible just to update it to the current point in time by sending over only the changes that have happened since the outage.) With the original volume brought up to date, the application can be moved back to its original location and the direction of replication reversed.
Compare that description of a typical CDP-based recovery scenario to the recovery process required by a traditional backup system, and you should get a good idea of why continuous data protection is the future of backup and recovery.
EDITOR'S TIP: Click here to read this next part of this guide on near-CDP.
W. Curtis Preston is an executive editor in TechTarget's Storage Media Group and an independent backup expert.
This article was previously published in Storage magazine.