|Disk-based backups are more forgiving|
When backups are allowed to run indefinitely until they are complete, the following bad things happen:
- Degraded performance of backup clients during business hours
- Inability to start the next night's backup because the previous night's backups are still running
- Maintenance windows for the backup server and tape libraries may be lost
- Increased likelihood of encountering open files, which are frequently skipped by backup applications
There are two fundamental ways to shorten a backup window: reduction and elimination. Reducing the amount of data to be backed up decreases backup durations by up to 50% or greater. The elimination approach makes use of snapshots and point-in-time (PIT) copies to shrink the backup window to just minutes.
|NDMP Speeds backup traffic|
The Network Data Management Protocol (NDMP) began in 1996 as an initiative to create an open standard for network-attached storage (NAS) backup. Pioneered by Intelliguard and Network Appliance, NDMP is now supported by most major backup software and NAS hardware vendors, as well as operating system providers. In April 2000, control of the specification moved to a working group operating under the auspices of the Storage Networking Industry Association (SNIA). The protocol standard is currently in v.3, with v.4 being finalized by the NDMP Working Group.
NDMP defines a way for heterogeneous file servers on a network to be backed up. As with many initiatives, a stated goal of the organization's efforts has been to create a standard approach to network backup that reduces the amount of effort backup software vendors have to expend to add support for the list of NAS platforms and operating system releases. This standards-based approach is designed to ensure the availability of backup-ready products from a variety of vendors who only need to make a minimal investment.
According to NDMP.org, the protocol allows the creation of a common agent used by the central backup application to back up different file servers running different platforms and platform versions. An NDMP server essentially provides two services: a data server, which either reads from disk and produces an NDMP data stream (in a specified format) or reads an NDMP data stream and writes to disk, depending upon whether a backup or restore is taking place and a tape server, which either reads an NDMP data stream and writes it to tape, or reads from tape and writes an NDMP data stream, depending upon whether a backup or restore is taking place. Tape-handling functions such as split-image issues are managed by the tape service.
So how does NDMP shorten the backup window? NDMP minimizes network congestion be separating the data path and control path. Backups occur locally on file servers direct to Fibre Channel/SCSI connected tape drives while management is centralized. NDMP, as a standard protocol, is promoted and supported by server vendors, backup software vendors and backup device vendors. A variety of products have sprung up which utilize the protocol--backup software, messaging appliances, tape products and products supporting NAS filers. Visit www.ndmp.org for more information, or view a list of compliant products at http://www.ndmp.org/products/index.shtml#backup.
The most common ways to reduce backups include:
- Reducing the amount of data selected for backup
- Reducing the number of full backups
- Using hierarchical storage management (HSM)-like tools to migrate data, primary file systems and data stores
Exclude lists can be maintained globally on the backup server or locally on each client. Excluding files can be tricky because all systems may not use the same naming or usage conventions. For example, some administrators may store files of value in a temp directory.
If exclude lists don't appear to be a reliable option, the amount of data backed up each night can be reduced by performing full backups less frequently. A common approach is to perform a full backup once a week and incremental backups the other six days of the week. This means that every week large numbers of files are backed up over and over again, even though they haven't been touched since the last full backup.
Most configurations include a full backup at least once a week to reduce the amount of incremental tape volumes that will be needed to be read from to perform a restore. If full backups are performed only once a month, it's possible that up to 29 incremental backups would need to be restored to retrieve the full file system or directory. Traditional tape technologies--and loading and reading from a large number of tapes--are typically slow.
New technologies, such as backup to disk and synthetic fulls, are making full backups a less-frequent requirement. Backup to disk eliminates the overhead associated with the loading and tape seeking encountered with each individual incremental backup. Synthetic fulls create a full backup-like dataset by moving files from the last full backup and subsequent incrementals on various tapes to a single data set that typically spans only a single tape volume (the number of volumes equals the dataset size divided by tape volume capacity). Both backup to disk and synthetic fulls eliminate the need to load and read from individual incremental tape volumes, minimizing the restore time.
A third method of reducing the amount of data being backed up is to deploy HSM technology. HSM is a policy-based data migration tool that moves infrequently accessed files to a different storage target. In addition to reducing the amount of primary storage required, HSM also reduces the amount of storage backed up during full backups because only a file stub remains on the primary file system when a file is migrated.
In many organizations, much of the data backed up during each full backup includes files that haven't changed in months or years. But HSM can present its own set of challenges. Specifically, the impact of an HSM solution on the various applications that might be affected needs to be understood. Also, there should be an awareness of how the HSM and backup applications interact.
Reducing the bottleneck
After reducing the amount of data being backed up through file exclusions, less-frequent full backups or HSM technologies, it's time to look at moving the data from the source client to the backup target more quickly. This effort entails finding and eliminating bottlenecks along the data path. Common bottlenecks in the data path include:
- Client resources such as disk drives, CPU cycles, memory, network interface and file system attributes
- Network resources such as the IP LAN or ISLs in the Fibre Channel (FC) SAN fabric
- Backup server resources, CPU cycles, memory, network interface and tape/disk backup target devices
Data can only be backed up as fast as the client can source the data. Running a backup creates a substantial load on the backup client because reads of nearly every file in the file system are required. Common client system components such as disk drives, memory, CPU cycles and network interface are all taxed when a backup runs.
Another bottleneck exists with the physical storage. It's not uncommon to have the same group of spindles accessed from two different servers or two file systems from the same server. Depending on the disk configuration, simultaneous backups of different clients or file systems could create disk contention, limiting the client's ability to read the data. Rescheduling backups that share common spindles to run at different times is also a solution.
Another bottleneck may occur in organizations where the majority of clients still backup over an IP LAN. The processing overhead associated with pushing a large-sustained amount of data over an IP network interface card (NIC) can tax client CPUs. The CPU load created by the IP processing overhead is frequently associated with iSCSI performance issues, but also can impact backup performance. New backup architectures such as SAN-based backups send the data files over shared FC SAN-attached devices using a more efficient protocol optimized for the large sustained data movement associated with backups. SAN-based backups also reduce the overall load on the company LAN because the data files are copied to the backup devices using the FC SAN.
File system characteristics also can be a source of slow backup performance. File systems with millions of small files (which is becoming more common) usually back up more slowly because of the overhead associated with recording the metadata for each file on the backup server and the time it takes for the file system to look for changed files.
Typically, the overhead of recording metadata is negligible because the ratio of data to new files is very high. However, in systems with large numbers of small files, this ratio reverses and the overhead impacts overall performance. File systems also can cause bottlenecks during incremental backups, when the backup client needs to check the file system to identify which files have changed.
|Two-tiered backup architecture|
The introduction of two-tier backup architectures allows much of the load associated with moving data from the client to the backup target to be offloaded to dedicated data movers (storage nodes/media servers).
In a traditional client/server backup architecture, data sent from a client to a backup server moves through the backup server to the target devices (see "Two-tiered backup architecture"). In traditional IP-based architectures with a large number of clients, a tremendous amount of data will pass through a single backup server. Backup servers' CPUs, memory, NICs or internal I/O buses are frequently maxed out in larger environments.
The introduction of two-tier backup architectures allows much of the load associated with moving data from the client to the backup target to be offloaded to dedicated data movers (storage nodes/media servers). The centralized backup server is still responsible for managing all of the metadata and shared library/robot control. The NDMP protocol (see "NDMP speeds backup traffic") allows network-attached storage (NAS) appliances to act as data movers, minimizing backup generated LAN traffic.
At the far end of the backup data path, tape drives are often the focus of backup bottlenecks. Newer tape drives are fast, with some exceeding 30MB/sec. It isn't uncommon for a tape drive to achieve higher throughput rates than disk drives. But achieving maximum tape drive throughput depends on sufficient amounts of data being sent to the tape drive to sustain data streaming. If insufficient data is sent to the tape drives back-hitching will occur, greatly reducing overall throughput.
If too much data is sent to the drives, then the drive once again becomes a bottleneck. The amount of data written to a tape device is usually controlled by adjusting the number of simultaneous write sessions (also called multiplexing) to each tape device (see "Disk-based backups are more forgiving"). The downside to multiplexing is that restore performance is decreased because backup sets are interleaved on the tape.
A frequent backup mistake is letting backup clients temporarily mount tape drives through a shared target software option in an attempt to improve backup throughput. Typically, the throughput for the one client is improved because the tape drive is temporarily dedicated to a client. This eliminates tape drive contention and allows data to be moved to the target tape drives through the FC SAN, while eliminating the IP processing-intensive overhead. But by improving backup speed for one client may decrease overall throughput to the tape drive because only a single backup client is writing to the drive. In this situation, the greater good of all systems may be sacrificed to benefit a few systems.
Eliminate backup windows
A different approach to solving the backup window problem is to create a snapshot or PIT copy of data used for backup purposes. Once the PIT copy is created, normal data processing can resume because an image of the quiesced system has been captured.
Snapshots create a virtual copy of data. It's called "virtual" because the second copy is only created if blocks are changed after the copy was initiated. Because most data in a volume doesn't change daily, these snapshot copies don't take up a large amount of disk space. The additional disk space required is equal to the amount of changed data. Snapshot creation is typically scripted; it takes less than a minute to create the virtual copy of a volume. Once the copy is made, the primary data volume is available again for changes without any impact to the backups. Snapshots address two critical backup window issues:
- They provide the ability to resume data processing without fear of having open files skipped because a quiesced system is only required while the snapshot is created.
- You can start the next night's backup even though the backup from the night before is still running because each night's backup image is captured on a separate snapshot image.
PIT copies, sometimes referred to as clones, are similar to snapshots because they also quickly create a copy of data that can be used for static backup images. The advantages of PIT copies are that a full copy of the data resides on a completely separate volume, while snapshots only copy the changed portion of a volume. This helps address the issue of resource contention caused by backups on production servers. The PIT copy volumes can then be mounted to surrogate clients. Disk drives and all client system resources driving the backups are separate from production disk drives and system resources. Like snapshots, PIT copy creation can be integrated into applications, ensuring clean copies of data quickly.
The size of the volumes typically has minimal impact on how long it takes to create snapshots or PIT copies; therefore, this method scales easily as storage capacity grows. The downside of PIT copies is the higher cost associated with the increased disk capacity needed to create full copies of data on different drives.
Although snapshots and PIT copies reduce backup duration, another big benefit is the effect on restores: If a user is looking to retrieve data from last night's backups, the data doesn't have to be retrieved from tape because all the information resides on disk.
Restore performance issues
In optimizing data transfer for quick backups, administrators may unknowingly create restore performance issues. This is particularly true when trying to reduce backup window durations by decreasing the ratio of fulls to incrementals, or by increasing the number of streams multiplexed to a tape drive target.
When decreasing the full to incremental ratio, restores may require that more tape volumes have to be mounted and read. In one incremental-forever scenario, we saw a customer recall more than 1,000 tape volumes to restore a single system. Most incremental-forever systems have controls to limit the number of tapes that a single file system can be spread across, but this requires additional configuration and data movement cycles. Fortunately, this data movement isn't typically associated with a backup window because it doesn't impact client operations. In addition, the proliferation of disk-based targets makes reading from a large number of incremental backups a non-issue (due to the random access nature of disk).
Multiplexing a large number of backup streams to a single tape drive causes restores to degrade. Because multiplexing intersperses data from one client with another, the sequential nature of tape often requires reading all data on the tape to retrieve the bits associated with a single file system. Depending on the tape technology and priority or restore speed, it's not uncommon to have four or more streams multiplexed simultaneously to a single tape drive.
Where to start
If a large percentage of backups are running slowly, causing them to run beyond the desired backup window, look at components shared by all of the clients. This includes the backup network, backup servers and backup target device (tape or disk). If only a small percentage of clients are exceeding the desired backup window, look at client-side issues first. Start with just a few clients. What you learn from the first few will likely help troubleshoot the others.