Published: 10 Mar 2007
Are backups a waste of time?
New business requirements may mean your backup process won't adequately protect company data.
"It is not the strongest of the species that survives, nor the most intelligent, but the one most responsive to change." --Charles Darwin
"The future ain't what it used to be." --Yogi Berra
At the risk of becoming overly philosophical, every once in a while it's important to take a step back from day-to-day activities and reflect on why we do the things we do. What real purpose is served? Are the original requirements still applicable or have new ones cropped up? Have new or better solutions evolved?
In storage management, there's no better candidate for this kind of scrutiny than backup. It seems that nearly everyone is evaluating new technology options, from virtual tape libraries to continuous data protection and agentless backup. While it makes sense to seek out these improvements, when a mass push to adopt new technologies occurs, I become concerned that insufficient attention and thought may be given to that most fundamental of questions: What problem are we really trying to solve?
Data protection has changed not only in terms of technology, but with regard to business needs and expectations. As a result, there are data protection functions and services that are routinely provided within IT in addition to traditional backup. Yet many of the efforts underway to improve traditional backup don't adequately take these other data protection elements into account.
For example, if we're currently replicating top-tier application data to a remote site and taking regular split-mirror or snapshot copies of data volumes, why do we continue to do nightly backups? What's the purpose of a nightly backup and, given a demonstrated requirement, how do we design the right solution to adequately address it?
Covering the right risks
We've all seen ads from insurance companies promising protection from an assortment of oddball circumstances for "just pennies a day." You can't beat the insurance industry for knowing the real odds and pricing its products accordingly. On the other hand, you don't buy insurance because it's cheap; you buy it to protect yourself against a given set of risks. If the right risks aren't addressed, then price is irrelevant.
Like other IT functions, data protection has become much more specialized. This is the direct result of an expanded awareness of the variety of risks, increased levels of user expectations and the growing range of technology options available to address specific problems. As a result, the way we approach risk and risk-related services also needs to evolve. We need to think about the likelihood and impact of various risks and the kinds of "coverage" we need.
Let's consider a subset of the various types of data loss and availability risks that require protection:
- Detectable file deletion or corruption. In this scenario, data is accidentally deleted or overwritten, and there's an almost immediate (let's say less than a day) realization that an error has occurred. There are lots of ways this can occur, and the data loss can be logical or physical.
- Latent (lingering) data deletion or corruption. This risk is rarely quantified, so there often aren't any policies to address it. It's when data is deleted or, more likely, logically corrupted in some way, but the loss can lay undiscovered for days, weeks or months.
- Storage device failure. This is a type of physical loss, usually of a significant quantity of data.
- Interdependency failure. This can be thought of as "effective" data loss due to lack of synchronization or data inconsistency across multiple application components. It's the "weakest link" effect. If one level of service is provided to a portion of an interdependent environment and a different level to another part, the overall protection level is only as good as the lesser of the two levels of service.
- Compound failure. Similar to interdependency failure, this is the risk of any combination of the above data loss scenarios happening concurrently.
- Site failure. Loss of a site, such as a data center, is usually considered a disaster and falls under the realm of disaster recovery (DR). It's yet another class of scenario that needs to be differentiated from more localized operational situations.
Where backup fits in a new data protection world
Not too long ago, the solution for addressing almost all of these issues would involve backup, either entirely or to some degree. But as we all know, recovery from backup takes time and is disruptive, so technologies were introduced to reduce the recovery time or eliminate downtime.
|Typical protection methods|
The table "Typical protection methods" shows the types of protection usually applied in today's IT infrastructures to the kinds of data loss described above.
Consider a scenario in which a critical database is protected with RAID-10, has business continuance volumes (BCVs) created every four hours and retained for a day, and is remotely replicated to a hot site. What failure conditions would backup address in that case? Its primary value would seem to be protecting against latent undetected data loss where a corruption may be discovered days or weeks later when BCVs have long since been modified, serving as a kind of "failsafe."
Toward a layered data protection model
When we plan data protection services, it's generally agreed that the critical metrics are recovery time objective (RTO) and recovery point objective (RPO). But given the various failure conditions described here, it's highly improbable that a given RTO or RPO target could be successfully met across all failure scenarios.
For example, consider the typical BCV/replication/backup protection combo and assume a four-hour RTO and RPO. There are scenarios, such as latent data corruption, where backup is the only protection available and the four-hour recovery metrics would be completely blown away. Is there any way to mitigate this?
The answer is "Maybe." For instance, it's not uncommon for a diligent, risk-averse database administrator to perform database dumps to disk, and have days or even weeks of additional copies stowed away. If these were available, the recovery time could be considerably closer to the RTO target vs. restoring from backup. In any case, RPO would likely be far exceeded.
When we plan for DR, we must take into account the types of risks and their probabilities. A similar approach should be considered when planning an overall data protection strategy. I'm suggesting that a layered services approach to data protection is needed in today's environments. Such an approach should identify the following for each layer:
- The risk to be mitigated
- The targeted level of protection to be provided based on probability and business impact
- The protection method required to address this need
Change is always difficult, but history is littered with examples of those who were unable to adapt. This approach may raise concerns about exposing weaknesses within IT's capabilities. If you fear such transparency, then there may be risks associated with this approach. For environments that believe they can recover instantaneously from any kind of data loss, the truth may be unpalatable. However, if an organization is sincere about addressing service-level objectives, this process can uncover holes in existing data protection strategies and ensure that when a major new technology direction is undertaken, it will be the right one.