Tech Talk: Classifying backup, disaster recovery and archivingDate: Sep 13, 2013
Jon Toigo: Well, backup is one strategy for protecting data assets, and the protection of data assets is part of disaster recovery planning or business continuity planning. Without your data, you are out of business. So, backup is a way to safeguard those assets through what's known as the strategy of redundancy.
Most everything in your company can be protected in one of two ways. The first is the strategy of replacement. Say somebody drops a can of Coke over a server and fries it. You can go out and buy another server to replace it, or you can keep an extra copy of a server available so you can slide it into the rack and replace the one that died.
The problem with data is you can't replace it. Now, you'd be surprised how many companies still have plans dating back to the 1950s where they say. 'If there's ever an interruption or our building burns down, we'll get a whole bunch of people in and we'll retype in all the invoices to bring our invoice system back up to speed.'
Interesting concept, but in the day of the Internet and the World Wide Web and 24/7 operations, you don't have time to do that. So, you can't replace data, you can only make it redundant. You can make a copy of it and get that copy out of harm's way. One of the most efficient ways ever invented to do that is backup. You'd back it up to tape. The tape is a removable medium: Stick it in a box, send it off to Iron Mountain and you've got a copy safeguarded off-site.
So, what would you say makes an archive an archive? A lot of companies treat their old backups as archives, so, what are they missing?
Toigo: Okay, there's archive with a small A, and there's archive with a capital A. I guess any collection of data is technically an archive with a small A. Some companies have thought that their backups, which represent a point-in-time snapshot of all data at one particular point in time, [are] in fact an archive. They used to just save their old backups and treat those as though they're archives.
However, real archiving may require capabilities that aren't facilitated by a backup set. You may want to search an archive, find an instance of a particular kind of data or find a chain of information within all the data that you have. That's hard to do in a backup.
Archiving software responds to certain policies and groups things together in a logical way. It provides an index and metadata to facilitate search and discovery. And oftentimes it gives you a lot of flexibility in how you organize the data.
Basically, an archive offers a greater degree of granularity in terms of the organization of the data and its discoverability or its searchability. That's very different from a backup. A backup is simply a copy of data at a given point. However an archive should have these extra attributes to it, the archive with a capital A. By the way, that's called a deep archive.
With a capital A.
Toigo: Right. There's also something called [an] active archive, which further confuses the picture. That's using tape as a file system, using tape as a file server, LTFS, linear tape file system.
There are a number of products on the market today that claim to reduce the number of copies of data you need for data protection. Do they really do that?
Toigo: Yes, you're talking about deduplication, for one thing. Deduplication was [once] described by a very, very smart guy as a waste management system for backup. Backups traditionally were full volume, meaning you would back up everything. And then the next night, you would back up everything again, hoping to capture the data that has changed. And then the next night, you would capture the whole backup again, hoping to capture the new data that has been added or the data that's been changed.
And you do that over and over again. Let's assume that it's a 1-terabyte backup; at the end of the week, you would have to allocate somewhere along the lines of 5 TB to 6 TB of storage capacity for all the storage you have of the data.
The thing is that most of the copies, as much as 90% of each copy is in fact the same as the data you already have.
So, if we could take that data out, we shrink the requirement from 5 TB for five days of backups to maybe 1.5 TB, which of course makes sense because we're trying to economize on disk.
Deduplication goes out, and they find the same bits or the same files or the same whatever; and wherever they've got the same, they take whatever the latest and greatest version is and save that and discard the copies. That way, you condense the amount of information. That technology works, but you worry about what it's going to take to 'rehydrate' the data, or take it out of its deduplicated state and read it again.
In some cases, this is a nonissue, but in many cases the algorithms actually compress the data in such a way that you need a reverse algorithm to pull it back out again to get at all the data that's there. That actually adds time to recovery in a disaster-recovery scenario. It also adds complexity and requires that you have access to the software that was used to deduplicate it in the first place.
I don't want to deal with any of that stuff. My building just burned down, I want to be able to put my data back into a usable state without it relying on a whole lot of instrumentality to do that. I understand what it's for, and it has its place, but I'm not a huge advocate of dedupe.
Also, a lot of my clients who have financial institutions don't deduplicate their data, especially financial data that's required by the SEC [Securities and Exchange Commission], because they're concerned about what would happen if they were ever sued by an irate shareholder that says, 'Did you ever deduplicate your data?' and the lawyer says, 'Well, I don't know what deduplication is. I'll find out.'
He goes to IT. IT says, 'Yeah, we use deduplication; sometimes we restore data from deduplication.' So, the shareholder will call a sidebar with the judge and say, 'Your honor, according to SEC rules, you're not supposed to provide anything but a full and unaltered copy of this financial data. It's a violation of SEC guidelines.' There's no proof that deduplication doesn't materially alter the data, so my clients say, 'Look, I don't want to be the poster child.'
It doesn't matter if it does or doesn't. In fact, it probably doesn't, but that doesn't stop there from being an $11 million bill in a lawsuit to prove it. And they don't want to spend that kind of money on something like validating that deduplication doesn't materially alter data.
Is dedupe effective? To a certain extent, yes, it is, but you can also do the same thing with incremental backups. That's where you go in and just take the changed files and you make a backup of those on night No. 2, No. 3, No. 4, No. 5.
It's the same thing except, instead of doing a full volume, then shrinking out the stuff that's already there, I just copy the changed data. In fact, I wonder these days if backups really are the preferred method to go at all. You know, in terms of the traditional mode of backup.
Backup takes all of the data you have, assembles it into a container, a backup file, and stores that file out on whatever media you're going to store it on. To restore, you need the software you used to create the backup container to restore that container back onto disc and then to open that container up and unpack all that chewy goodness, the files that are inside. That's a lot of work, and it takes time to do it.
Today, with products like LTFS -- linear tape file system -- if I've got a file system over here that has a whole bunch of files that I need to back up, I can just copy them over to tape with the file system intact. I don't need a backup container to do that. That gives me a new and kind of innovative and much simplified way of making a copy of all that data.