Testing backup integrity and performance
GlassHouse Technologies storage consultant Jeff Harbert said backup administrators should test their backup infrastructures for data integrity, technology and performance. On the data integrity front, he warned that backup tools are not always reliable indicators of whether or not a backup is recoverable.
"We had a case with a client where an Oracle RMAN script didn't work, but [Symantec] NetBackup said all was OK," Harbert said. "The user didn't know that it didn't work until he went to restore."
"Thorough tests should include restoring the data without the deduplication device," he said. Some vendors like Data Domain offer snapshot backups of their systems for such a scenario. "Tests of encrypted backups should run through the scenario of the main key management system being down. The best case scenario would he that you have a backup of the key management system or replication of the key management server, test that."
When it comes to newer technologies, such as disk-based backup and data deduplication, Harbert recommended testing over a long period of time. "Some of the approaches to data deduplication can affect recovery performance in different ways over time," he said. "The last backup you did may be all pointers [to other data], that's when you need to test backup."
For performance, users should test new devices at the time of implementation. "Don't assume you're getting the rated throughput or that speed will follow functionality," he said.
"The reality of testing is that it requires time, resources and management buy-in," Harbert acknowledged. Storage administrators at the show said these are the main challenges when it comes to testing backups, especially in newer Web 2.0 data centers that are moving away from traditional backups.
"We restore files that need to be recovered daily in our environment, but regular testing in addition to that takes resources we don't have," said Kris Knutson, director of infrastructure services for Carfax.com. Because the data center supports a website that can change up to 15 times daily, "Right now we're stressed to get product out," he said.
"Testing is important, but nobody does it," said Michael Passe, storage architect for Beth Israel Deaconess Medical Center. "Trying to select the right thing to test requires coordination and buy-in across the data center, which is never 100% certain." Passe noted that application owners in his shop share reporting data they get from their applications. "But we're not doing the reverse" with data from its EMC/WysDM reporting tool.
Setting realistic expectations with SLAs
Independent consultant Brian Greenberg gave users a structured template for setting SLAs for data protection in their environments. The template includes calculating the cost of each type of backup method available, ascertaining the relative importance of data based on disaster recovery objectives, and assessing backup policies according to retention periods, legal requirements and number of copies.
That's all a good idea, users said, but indicated that at best they're doing a far less formal version of Greenberg's template. Meanwhile, 19% of attendees at the conference's opening keynote answered, "What disaster recovery plan?" when asked how often they test disaster recovery.
"Our backup SLAs are not official – most are based around disaster recovery, but we've also developed SLAs internal to our own group based on data we know will be a loss to the business, even if it's not prioritized in the disaster recovery plan," said a systems administrator for a major utility who asked not to be named.
"Our company merged with another one about two-and-a-half years ago, and the disaster recovery plan is one of the last things being pulled together for the new organization," he said. But, "a series of informal conversations" with application owners have indicated that only about 10% of the data needs to be restored immediately in the event of a disaster, while the rest can wait for hours or, in some cases, days.
According to Passe, "We know the RTOs [Recovery Time Objectives] and RPOs [Recovery Point Objectives] on the disaster recovery side, and what trickles out of that is generally the right SLA for backup, but it's not concrete. Nobody can beat me over the head if I can't recover in X hours."