Tech Talk: Backup and disaster recovery plan best practicesDate: Sep 13, 2013
What aspect of backup and DR [disaster recovery] can be dealt with together without messing up either process?
Jon Toigo: They're integral processes. They're joined together at the hip. I would say that the nexus that maybe isn't really understood between backup and disaster recovery is that disaster recovery plans have to be tested periodically to make sure that they are going to work and to make sure they're still up to date with what's required to recover business processes, which is what it's all about. A lot of people waste a lot of precious time for testing, doing data recovery testing, seeing whether they can recover data successfully off of a set of backup tapes, or seeing whether data has been mirrored and we can switch to the mirrored copy of the data when the time comes.
We shouldn't have to do that as a part of disaster recovery testing. It increases the length of the test. It increases the costliness of the test. We should focus on a data protection scheme that gives us the ability to test the integrity of the data that we've captured, the right data that were copied, the right data that the data is restorable, that the data is recoverable. We should be able to do that on an ad hoc basis as part of the day-to-day operation. We shouldn't have to do that as part of a formal test.
So, I think the best thing is define a data protection strategy, like backup, that can be verified independently and test it on an ongoing basis, not as part of a formal test, which is a part of disaster recovery planning. But you keep trying to separate or differentiate backup from disaster recovery. The two are integrally related. There's no reason to do a backup unless it's for disaster recovery.
Does where you put your data copies change the way you approach DR?
Toigo: I would say, as long as you make a copy and store it off-site, you're about 99% of the way there. You want to do that across a WAN from one disk to another disk, that's an approach. It's an approach that some companies like to use. The trick is to make sure that the data is put far enough away from the original in terms of distance, because as you know, we just had Hurricane Sandy that came ashore in the Northeast.
We've had other disasters that had a very broad geographical footprint. And if the data copy is in the same zone of destruction as the original disaster, you're hosed, putting it simply. So you want to go at least 50 miles to 100 miles away with your copy. If you're doing that on a WAN, a wide area network, that's prohibitively expensive if there is a lot of data to move.
Moving 10 terabytes [TB] of data across a wire using OC192 technology available to core carrier network players, like AT&T, that'll take four hours. You do it across a T1 line, it will take over a year to move 10 TB of data. The fastest way ever invented to move data over distance is what's known as IPVAC -- IP, Internet Protocol over AV and carrier -- which is strapping a USB key to the leg of a passenger pigeon. The data gets there faster than it does going across a network. So, that's being kind of silly, but that shows some of the vicissitudes you run into with [WAN]-based disk-to-disk replication. Dubbing it to tape takes advantage of that avian carrier, if you will, but the avian carrier might be FedEx.
Clouds are the worst of both worlds. First of all, we have to go across a wire to park it up in a cloud, just WAN-based replication. Also, if the cloud provider has a lot of an area that was devastated by, say, Hurricane Sandy, and they never once considered that everybody would ask for their data back at the same time. They didn't have the bandwidth to provide it, and we heard stories that some companies were waiting as long as a year to get their data back. Would you be able to go on with your business without your data for a year?
So, those are some of the gating factors on vetting these different solutions. Is it always the case? No. I mean, there are some clouds that have a little more alacrity than that, but then again, those clouds are probably within the same geographical areas as your company and both of you may get your clocks cleaned by a natural or man-made hazard with a broad geographical footprint. So, you've got to be aware and plan to the worst-case scenario, and figure out [how] to do it in the most cost-effective way.
If a company is trying to look at data protection holistically, where should they start? Backup, DR or archiving?
Toigo: I would say that where you start for holistic data protection is how you create an infrastructure. I want to only have gear that's manageable -- in other words, that I can see. Unfortunately, a lot of us haven't deployed infrastructure that's manageable. Now there are a couple of ways we can do this.
We can design it in and ask all of [our] vendors to build in some capability into their gear like RESTful management, which is based on a standard called REST [representational state transfer] from the World Wide Web consortium. And we can wait for them to do that, which they may or may not ever do; or we can choose one of the management products that are out there, like Tivoli, CA, Symantec or whatever, and we deploy it and that's going to be our storage management framework for now on; and we tell every vendor we're not buying your gear unless you can be managed using this common management utility. Because we need to know what's going on with the infrastructure, because a disproportionate amount of data disasters accrued to failures of whatever type.
We want to see those situations burgeoning so that we can correct them before they hurt us. Now, beyond that is the data management problem. Data, when it's written by an application, an end user should be afforded a set of services [that are] appropriated to that data. If it's data we're going to hold on to for a long time, maybe it needs to become a part of [a] tiering scheme. The data gets first written to really fast storage, where it's used during the time when it's accessed a lot, and [then] it migrates to slower storage, with a greater capacity and lower costs, and then it migrates down to tape. If you do that in a rigorous way, you end up with about 60%of your data down at the tape layer, and that dramatically reduces the costs of your infrastructure. If you've got a 100 TB infrastructure, your total cost of ownership if you're using tiering like I described, is going to be $350,000 average. If you're not, if you're using just two tiers -- disk and disk, fast disk, slow disk -- it's going to be somewhere in the neighborhood of $150 million.
These numbers are coming from one of the gods in this industry, a guy who runs Horizon Information Strategies. His name is Fred Moore, and he did the definitive work on storage tiering and what it costs. Now, that's a service applying a policy for data migration over its useful life. Another set of services might protect against different kinds of data disasters -- I call that defense in depth. There's a possibility that data can become corrupt when it's written to disk. So, maybe you want to do something like RAID or erasure coding, or one of these other techniques like continuous data protection where you're writing the data somewhere else on a continuous basis, so if you write it to disk and the disk fails or the data is corrupted when it's written, you can recover the data that's local.
What happens if somebody drops a can of Coke inside the cage, and it spills down the backplane of your array and fries it? What happens if a sewer pipe in a ceiling breaks and you can't go into it because it stinks or because it's unhealthy, it's a biohazard. Believe me, I've been through all of this. I've seen all of these things.
Then you have the geographical-footprint disasters, the hurricanes, the earthquakes, the CNN-style stuff, a terrorist attack, all of that. You're going to need different modalities to protect the data against different possibilities. The local one in the facility might be a mirroring scheme where it replicates data on an ongoing basis from an array over here to an array over here sitting right next to it or somewhere else on the corporate campus. It's all internal to the LAN of the company. Whereas the geographical disaster may require tape backup with off-site storage, may require WAN-based replication, something like that.
So, chances are you're going to be using an interlocking set of data protection strategies, all integrated as a set of services; and I should ideally be able to check mark as I write data, say, this data is going to get this service, this service and this service, and send it to that volume. I can do that if I virtualize my storage infrastructure. I can define virtual volumes to write data that offers different combinations of services.
What do you think the most frequent mistake is for companies trying to put together backup and DR plans?
Toigo: Two things: One, they tend to get too preoccupied with the threats. So, they end up writing a great script for a sci-fi movie. They spend a lot of time developing scenarios, because that's fun to do too. I used to do that too. It's better for a Hollywood screenwriter than for a DR planner: You spend an exorbitant amount of time creating, 'Well, okay, terrorists break in, and they blow this up, and they cut these wires; and you know, how are we going to protect from something like that?'
Well, you know, you hire Bruce Willis. We all know that. Or an asteroid is coming from outer space, and how are we going to protect our facility and our data against that? Who cares? It's game-over at that point. Let's figure out the reality. So, I say, just assume you got a worst-case disaster and then build a plan in a modular way so that you can respond to any kind of an emergency that might come along -- that's the practical method. The second thing you do is to make sure that whatever you choose to deploy as a strategy for data protection, for disaster recovery, that it can be tested.
I want to make sure I've got a dashboard that shows me that a mirror is working, because mirrors themselves are inherently non-testable. You have to quiesce the application, flush the data out of the cache, write it to disk one and mirror it to disk two, shut the whole thing down, and then do a file-by-file comparison between disk one and disk two to make sure you're copying the right kind of data. That takes a long time. Nobody ever does that. It's a hassle. So, we're exposed.
What we want to do is find a way to do that where we can test things readily. Last, don't call it disaster recovery. Management has this knee-jerk reaction not to want to spend money on protecting against a possibility that may never happen.
So, anyway, a lot of people tend to become so preoccupied about the threats that are associated with a disaster, they frame everything in, 'Oh well, we need to act immediately to safeguard against these disasters.' Management doesn't get that. So, what you need to do is call it something else. Lie, call it something completely different. Say, 'We're putting together data compliance. We're making sure the data we have to hold onto under HIPAA for the next seven to 10 years will be properly kept safe so that we're in compliance with the regulation and you won't go to jail.' If you call it something else -- compliance management, whatever -- you're likely to get funded, but if you call it disaster recovery, in my experience, nobody is going to give you any money for it.