cutimage - Fotolia

Faster recovery point objective part of backup evolution

Most users want their critical applications to have faster recovery time objectives, but IT departments preach a more realistic expectation.

A recent survey conducted by Veeam revealed that organizations expect a narrower recovery point objective (RPO) and recovery time objective (RTO) from their IT teams. According to the study, a majority of users expect to recover critical applications in less than two hours, but most IT staffers indicate they are not able to deliver recovery in less than four hours.

Backup products have evolved to meet tighter RPO and RTO windows, but both of these objectives require different data protection capabilities.

Meeting tighter RTO windows requires positioning secondary data in such a way that the application can access it faster. This means reducing or eliminating the amount of data moved back to production storage. To meet a tighter recovery point objective window, you must capture data more frequently than the traditional once-a-day backup process.

Better RPOs and RTOs start with better backups

Recovery is all about having good data backups. Without a clean, consistent backup, there is nothing to recover. Therefore, these backups need to happen more frequently.

To meet a tighter recovery point objective window, you must capture data more frequently than the traditional once-a-day backup process.

The good news is that modern backup applications have taken significant steps forward. Today's backup products use the block-level incremental backup APIs provided by many hypervisors and applications. A block-level incremental greatly reduces the amount of data that must be transferred across the network, which means less time interfacing with the application being protected and decreased use of the backup network. Because vendor-provided APIs are driving these block-level protection schemes, IT administrators end up with a clean, consistent backup.

Many data backup platforms -- especially those focused on the virtualization market -- also provide a replication function. In these use cases, the virtual machine (VM) is "stunned," or put in a backup-ready state, and a copy of its changed blocks made. If replication is enabled, the data is copied to a live state on another storage system. The changed blocks are then backed up as normal. The interaction with the VM is the same, but the recovery differences are significant.

Data protection and RPO/RTO

Recovery is the acid test of the data protection process. The ability to recover an application quickly, with as little data loss as possible, is obviously a critical factor. While frequent backups allow recovery at a more granular point in time, the way in which backup software instantiates data also affects the RTO.

The enemy of RTO is the movement of data back into production. It takes time to copy data back to the production system and the techniques that many backup applications use to optimize backups, like deduplication and compression, are no help. In most cases, all the data has to be recovered.

The exception is the few data protection products that perform changed block recoveries. These products can examine data that is already in place on production storage and recover only the data needed to set the clock back on the application. Of course, this assumes the primary storage system is up and running and that data corruption was not the reason for the failure.

If the data center uses a Tier-1 storage system, data corruption is a reasonable expectation as these systems traditionally offer tremendous availability. It is well documented that most failures come from data corruption caused by application code bugs or user error.

Restoring data in 15 minutes or less

Those looking for faster data restoration times have turned to recovery-in-place as an alternative to traditional backup, but it is not foolproof.

Because in-place recovery takes a good 15 minutes, if faster data restoration is required, replication allows critical applications to be recovered in less than 10 minutes. Failback is equally quick.

If the storage system fails or a changed block recovery is not available, an alternative may be in-place recovery. This increasingly popular technique allows backup data to be moved to a live state on the backup appliance. Recovery-in-place eliminates the need to move data across a network, and can protect against both a storage system failure and a server failure (assuming virtualization is being used).

There are three factors to consider before using recovery-in-place:

  • Performance of the disk backup appliance. Most disk-based backup appliances are "cheap and deep"; in other words, designed for high capacity, unlike production storage designed for performance. Backup systems typically lack the capability to deliver truly functional performance to the hosted VM's data store, so while the application can be quickly recovered, its performance may be poor or even unusable. However, some appliances offer a special staging area for data recovered in place that allows them to provide higher application performance.
  • Time needed to move data from a backup state to a live state. Most data protection products that leverage Changed Block Tracking (CBT) backups perform a full backup to create a master image of the server or VM. Subsequent CBT-based backup jobs are stored as separate files. After a finite number of these jobs (six to 10), a best practice from most vendors is to perform a consolidation job so the original master image can then be made current. This process takes time -- sometimes longer than the original job that created the master -- and affects recovery-in-place because the same process has to occur when a live data store is created. The master has to be merged with the appropriate number of CBT jobs, which takes time.
  • Failback. How does data get moved back into production when the storage system has been repaired or replaced? In a virtualized environment, functions like VMware's Storage vMotion can be used to address this problem by live migrating VM data from the secondary storage system back to the primary storage system. However, that process will negatively affect storage performance while the transfer occurs.

Next Steps

How explosive data growth is affected by RPO, RTO

RTO and RPO metrics show value of cloud disaster recovery

Hear what an expert says about RPO, RTO and backup

Dig Deeper on Backup and recovery software