Recently, while preparing a TechTarget seminar for presentation in London on my favorite topic -- disaster recovery in the era of software-defined everything -- I happened upon a link to a video another fellow's session I had the pleasure of attending at the Fujifilm Global IT Executive Summit in New York City late last year. That session, which I had found to be particularly illuminating and entertaining, featured Raymond Blum, Google's staff site reliability engineer. I encourage everyone to go and watch it.
No, this isn't just another plug for Google or cloud backup or some related trope. With the start of a new hurricane season on the East Coast upon us, I find myself contemplating the serious business of data protection and disaster recovery. These business continuity functions require the implementation and rehearsal of processes, infrastructure and command-and-control structures in advance of any interruption event. There are no shortcuts.
Blum says all of this in a very interesting and entertaining way, recounting lessons learned from a disaster in 2011, when Google had "misplaced the email of over 130 million users" and needed to recover it -- fast! He observes that redundancy, the currently preferred approach of the software-defined storage/virtual SAN crowd, is woefully inadequate as a disaster recovery strategy that must guarantee data integrity or durability. Organizations need a "chief data integrity officer" who acknowledges the simple reality that replication and mirroring do not deliver disaster recovery: you need backups. Preferably several of them. Preferably on tape. Yes, you need tape backups.
"Redundancy doesn't do what people think," Blum said. "If you make five copies of data on disk mirrors, you've got five bad copies… [The 2011 Gmail outage] showed that redundancy is not a recovery strategy." In the case of a zero-day attack, according to Blum, "integrity may be jeopardized everywhere that data is replicated."
He is basically underscoring what I hope will become a well-understood point -- redundancy provides great protection against low-level component failures. But to guarantee recovery from a disaster, you need a backup -- preferably on tape -- that has been tested and found to be restorable.
Restorability is key to the strategy implemented by Google in its tape strategy. Blum spends time describing how his company spreads data across five tapes in a way that enables four of the tapes to restore the contents of the fifth tape should it fail. This is a hedge against media failures, which "happen to hundreds of tapes per year," he notes, quickly adding that that number is a tiny fraction of the number of tapes used by Google annually.
More interesting perhaps is the methodology he has implemented for doing backup and restore of large volumes of data in a highly distributed server infrastructure. "It takes something like 112 years to backup an exabyte of data," he notes. But, by divvying up the backup job among many servers using MapReduce, Google is able to backup an exabyte in seven days. His is a fascinating, if unintended, use of a technology actually aimed at Hadoop big data analytics or high-performance cluster computing.
He concedes that no one likes to do backups, particularly tape backups. Still worse, no one wants to get the call that a disaster has occurred and a data restore is required. He recounts the process by which Google recovered from its email outage, noting that, at one point, one of the company's many tape libraries had a glitch in one of its robots, which picked a tape for use in restore, but kept placing it back on the shelf. Blum says he had to call a technician to go into the Oregon data center and resolve the problem. The technician initially refused to shut down power to the tape library, pull the tape from the robotic claw and manually insert it into the maw of the tape drive for fear that it might break the library. Blum's response was classic: "On my personal authority as Captain of the Enterprise, I order you to neutralize the hostile robot."
In the final analysis, nobody likes to do a backup, particularly a tape backup, or to spend time verifying the restorability of the data. However, it is the only way to be certain that the data you are stewarding has durability and integrity and that the applications and systems that are generating, processing, sharing, storing and networking the data are adequately protected by your strategy.
Was that a raindrop I just felt?
BIO: Jon William Toigo is a 30-year IT veteran, CEO and managing principal of Toigo Partners International, and chairman of the Data Management Institute.
DR plan your data center for hurricanes
Backup tape location is a key part of your DR plan