Network-attached storage (NAS) backup can be a challenge for many data storage administrators because in order to back up data, it has to sent to a backup server across the network. It requires
W. Curtis Preston, independent backup expert, discusses some of the most common challenges associated with NAS backups, the pros and cons of NDMP and other technologies that help simplify NAS backup in this Q&A. His answers are also available as an MP3 below.
NAS Backup Table of Contents:
>> What are some of the most common challenges associated with NAS
>> Is NDMP a good option for network-attached storage backup?
>> Are there any limitations with NDMP?
>> What other technologies exist to simplify NAS backup?
The challenge is that NAS backup is the base protocol. When we're talking NAS, we're talking NFS, server message blocker (SMB) or CIFS. The alternative to NAS is going to be storage area network (SAN), or locally attached storage, where the file system is on the server itself. So in this case, in order for the backup software to access the files it has to access it over that protocol -- over NFS or over SMB, which means by default it's going over IP. That means there's a possibility that the server on the other side, this NAS filer, might be able to meet the performance requirements. But that network could be just about anything; it could be a direct connection, a high-performance switch, or it could be garbage -- a very congested network or switch that is older. In other words, the challenge is that you're forced to do that backup over the protocol.
This means that you can't do some other fancy things. For example, in the regular file system world, one of the data backup challenges that we run into is what is called the "million file problem." This means that millions of files are in a single file system. And when you're doing backup at a file system level and you have millions of individual files, it tends to break down when we get into double digits and millions. The backup ends up being slower and the restore is also massively slower. For example, I've seen a 20 GB file system that took 72 hours to restore, not because the backup software didn't know what it was doing, but because of the overhead associated with it which was creating hundreds of millions of nodes that particular file system needed to create. When you're not backing up data in a NAS scenario, we can do things called image backups where we actually backup at the disk or LUN level and then just map the files. Then when you do a large restore, the backup is faster because it's operating at the LUN level. If you have to restore that large file system, you can just restore the entire LUN, and then all the files magically appear. This can be a hundred times faster than the alternative.
On the NAS side, you do not have that alternative because you do not have image-based backups. The only way to get to the data is from the NFS or SMB. So that's the core challenge, networks can be all kinds of different levels of performance.
NDMP was invented about 10 years ago. At that time there was NetApp Inc., and NetApp was this weird box that you couldn't back up. People would ask, "How do you back that up?" The initial answer was, "It's NFS; it does NFS really well, so go ahead and back that up." However, many people didn't want to backup their data over NFS. That's how the concept of NDMP came about. There were a bunch of vendors that needed a way to back up NetApp, or other competitors similar to NetApp, so they created NDMP.
A lot of people refer to NDMP as a backup format or protocol, but it's really a management protocol where a backup server can talk to a filer. Then basically the backup software does the job of putting a tape in the tape drive, virtual tape, or the device you're going to back up to, and they get everything ready. Once that is set, the backup software tells the filer to do a backup. And how that filer does the backup is really up to the filer.
One thing that was left out of the spec was the backup format because they had several competitors, each of which were using their own backup format to back up their filers. NetApp for example uses dump, others use CIO, some use TAR, enhanced TAR, etc. When NDMP connects to the backup server, it says, "I've put a tape over here, and I want you to do a backup to this device." And then there is something built into the protocol to communicate back to the backup software that says, "Yes, we did the backup and here are the files in that backup." The backup software can then access those files.
The backups can be done in three different ways: filer to self, filer to filer and filer to server. Filer to self is a tape drive that connects directly to the filer that is being backed up. Filer to filer is a locally attached tape drive to another filer. In filer to filer, one filer backs up data across the network to the second filer, and then that filer transfers the data to tape. In filer to server, the server is a backup server which actually accepts the NDMP protocol. The filer can back up across the network to that server which puts it on tape or another device.
NDMP provides this control protocol, and because it can do filer to self, it means that the backup software can put a tape in the right position and tell the filer to back up directly to a tape drive or virtual tape drive that is physically connected to the system, taking the network portion out of things. By doing that it can use a more efficient protocol for that server.
What's cool about NDMP is that instead of using it to do tape backups, you can actually use it to catalog snapshots. You can use it in your backup software so that you can be aware of these various snapshots that are out there and restore them throughout your backup software.
Essentially, NDMP allows people to do locally attached backups with their large filers directly to tape or virtual tape, which is what most people are interested in doing. This also significantly increases the performance of their backups and restores.
Not including the backup format as part of the NDMP protocol is one of its key limitations. For example, let's say you have a NetApp and an EMC Corp. Celerra and you want to move all your data over to the Celerra. In this scenario, you cannot take NDMP, backup a NetApp and then restore that NDMP backup to a Celerra. And this scenario is not limited to just NetApp and Celerra; it could be with NetApp and Celerra, Celerra and Hitachi Data Systems (HDS), etc. Different vendors have different backup formats that are incompatible with each other.
Here's another scenario where NDMP shows its limitations: Let's say that EMC decides two years from now that they no longer want to use dump, and they want to move forward to a new and better backup format that's yet to be invented. So they change from dump to something else. But what happens to the previous tapes? There's a concern there. They'd have to build backwards support, and potentially, if the vendor in question did it wrong, you might have to keep older filers around just to do restores.
Another limitation is related to file-level protocol. Most of the backup formats -- dump, TAR, CIO -- are all file-level protocols. If you have that million file problem, you again have this problem with files, which is why some of the vendors offer an image-level option. Several backup products support this, and when you kick off NDMP, you say "Back up this tape, or back up this volume. Don't use dump, use this other format." When you do that, you end up with an image-level backup which can only be restored back to a similar size volume, and you cannot do file-level restores.
I'm a fan of near-CDP, which is basically snapshots and data replication. If you are using a filer vendor and that filer vendor has integrated into it snapshots and replication, then that is the way you should go for backing it up because they are all delta level. You never again do a full backup. This is because the replicated copy looks like a full backup, but the actual backup only transfers delta level bytes. What you have is not only the data as to what it looks like today, but also what it looked like yesterday, the day before and the week before -- all in the same thing. Plus, the replicated copy is not in backup format, meaning it doesn't need to be restored to be useful. You can actually point your application while your primary filer is being restored to the secondary filer, and use your secondary filer as your primary filer while your primary filer is being restored. Once your primary filer is restored, you can either restore back to the primary filer or in some cases do what I call a delta-restore. For example, if just your power supply was down, and your filer is just a few hours out of sync from what's really going on, you can do a reverse delta replication, which copies the bytes back over that have changed in the past few hours.
So if you have a NAS product, I would highly recommend that you examine that functionality as the best way to do backups. And if your product doesn't have that functionality, then perhaps you should consider another NAS product, because in my opinion near-CDP is the best way to back up those systems.
This was first published in August 2010