GlassHouse Technologies, Inc.
Published: 12 Oct 2004
|iSCSI inspires low-cost storage networks|
Backup applications vary considerably in their implementation, performance and management. Here are some tuning factors to consider for leading backup applications:
VERITAS NETBACKUP RESTORE CONSIDERATIONS
Fragment size. Some applications like NetBackup let you specify backup fragment size. For instance, a fragment size of 2GB means a 100GB backup will be broken into 50 separate fragments when written to tape. When restoring data, rather than scanning the entire 100GB, NetBackup forwards the tape to the specific fragment containing the requested data, resulting in faster restores.
TIVOLI STORAGE MANAGER (TSM) RESTORE CONSIDERATIONS
Often, full TSM client restores take forever because the client data is spread across too many tape volumes. As dozens, or hundreds, of volumes are mounted for a restore, critical restore time is lost due to tape mounts, dismounts, mount wait settings and data seek time sinks. The key to successfully leveraging tape for restore in a TSM environment boils down to application policies and data classification. Streaming high-volume/large-file client data directly to tape maximizes tape drive performance in most environments. The same holds true for data restores from tape. TSM application policies also play an integral role in optimizing tape performance. TSM policies that directly affect tape restore performance include collocation, maximum number of mount points and resource utilization.
Collocation is a storage pool configuration parameter that can be configured to collocate data by client, by group of clients (available in v. 5.2 only) or by file space. Collocation by client means that a particular client's backup data is stored on tapes only for that client. This reduces the number of mount points required for a large restore and lowers the restore time. The downside: lower tape utilization requires more library space, and space reclamation processing can become time-consuming with a greater distribution of client data across physical volumes.
A handful of server and client settings also dramatically affect the amount of data TSM can move to and from tape for a given client. Increasing the client's maximum number of tape drive mount points and also increasing the client's resource utilization setting allows a client to run multiple data sessions to or from multiple tape devices.
For critical clients, many TSM users also run the occasional selective (full) backup to further reduce the number of mounts required for restores. Other critical considerations include network infrastructure settings and policy so that your network doesn't become the bottleneck once you optimize the application's tape use.
LEGATO NETWORKER RESTORE CONSIDERATIONS
Pools. Most backup applications allow data classification data based on various dependencies such as data type, backup start time, etc. Within Legato NetWorker, pools are used to distinguish what data is sent to specific volume sets. By default, NetWorker sends all data to one pool. Once a volume is assigned to a pool, only data that meets the specific criteria will be written to that volume. Most users split data based on data type, retention period or for off-site purposes when cloning is not in use. Pools should be used with caution. They provide a great way to separate mission-critical data, but can make NetWorker more difficult to administer.
Dedicated Storage Nodes. NetWorker provides the option to dedicate tape drives to specific clients, thereby upgrading them to storage nodes. The client will need to directly attach--via a SCSI or a storage area network (SAN) connection--to the library/tape drives. There are two ways to license a storage node. First, a full storage node license can be purchased, which allows backing up all local data directly to tape, as well as the ability to send other client data to the library/tape drives directly attached to the storage node. The second option is a dedicated storage node license, which doesn't allow other client machines to send their data to the storage node. The dedicated storage node license is considerably less expensive than a full storage node license, making it a wise investment when performing backups of large servers. Users also can implement dynamic drive sharing (via a SAN) to allow any system attached to the SAN to upgrade to dedicated storage nodes and to share multiple tape drives for backup and restore. The data is not multiplexed between clients--only one storage node can allocate the tape drive at a given time--allowing faster client restores.
Cloning and staging. All enterprise backup software creates duplicate copies of data. Cloning, with an off-site media rotation schedule, ensures that copies of data are available in a disaster-recovery situation, as well as retaining an onsite data set for recovery. Cloning in NetWorker demultiplexes the data, thereby creating contiguous savesets on the cloned volume. In restore situations, using the clone volume allows NetWorker to spend less time forwarding throughout the tape and more time restoring the data. Users with large amounts of disk available also can stage data to disk first (which is also demultiplexed) and then clone the same demultiplexed data to tape later. In NetWorker v. 7, using the advanced file type device option allows simultaneous read/write, which can greatly speed up the cloning process.
--Natalie Mead, with Nate Kosta
Exacerbating the tape restore problem is that few companies proactively monitor, report and remediate issues within their tape-based backup environments. This requires a great deal of effort and manpower, and an understanding of the tape infrastructure. In many cases, time is the limiting factor, leaving risky restores as an unavoidable consequence. In order to minimize the chances of an unsuccessful or time-consuming restore, it's essential that you prepare by optimizing your backup infrastructure for recovering data. This involves developing best practices for the backup infrastructure and refining overall operational approaches.
Best practices for better restores
Critical factors that are related to restore performance include backup application configuration, network configuration, media management and the client environment during a restore. The following guidelines will help increase the likelihood of successful file restores.
It's important to reduce disk drive contention for restored data. During restores, you should disable applications that may be accessing the same disks to which the data is being restored. Also, you should disable packet-reading software as well. Then there's virus-protection software, which when set to its highest protection level, scans every incoming and newly created file. During a restore, the recovered files appear as new files which would be scanned, thereby significantly slowing down the restore.
Some clients have too much data to back up over the network within an allocated backup window. For those hosts, backing up to dedicated tape drives can reduce the amount of time required to back up and recover data. Also, when possible, tune the network buffer size of the client's network card to match the tape drive buffer. This ensures that the recovered packets do not overrun or underfill the buffers. It also will help to modify the data transfer buffer sizes to match the tape drives. If data is sent in packets that are too small, the drives will end up spinning cycles waiting for data, and there will be empty space between data blocks on the tape. The further data is spread out on the tape, the longer it will take to restore.
You should also make sure you match the throughput of the host bus adapter to the drive throughput. If you attach 10 LTO-2 drives (30MB/sec each, for a total of 300MB/sec) to one 1Gb/sec host bus adapter (a theoretical maximum of 128MB/sec), data won't stream to the drives during backup. The sporadic nature of the data transfer will spread the data blocks across the tape, requiring even more time to restore the data.
It's also important for you to regularly expire your media. While the type of media you use does not typically affect restore times, the condition of the media does. As media experiences more and more read/write passes, the integrity of the media begins to break down, which can cause media errors. It's possible that data will be written to tape successfully, but then won't be readable because of the media's degradation. You should also be sure to clean your drives, too. If the backup fails because of dirty tape drives, all the preparation in the world won't help you.
This next tip may sound obvious, but you should make certain that the drive is available. The throughput of new tape drives often exceeds the total throughput of the data sent to the drives (slower networks or too few host bus adapters are typical causes). The result may be that the 10 new LTO-2 drives you just purchased actually run slower than the DLT7000 drives they replaced. There are several ways you can remedy this problem, but most of them involve making additional expenditures. These include things like upgrading the network infrastructure or replacing the backup server with a higher end system.
There is an alternative way to create a balanced tape, network and backup server infrastructure, which is to reduce the total number of drives being used during a backup. Not only will this improve backup performance, it also will leave some drives free for restores in the case they are needed. If a restore request is received during the backup window, and all the drives are actively backing up data, either the restore isn't performed until the backups finish, or active backup jobs are killed to handle the restore request.
If the data is critical to your business operations (and therefore, has a short recovery time objective), you should consider implementing additional solutions, such as snapshot or raw-disk backups, to improve backup and recovery performance. Either of these processes may incur additional expenditures, but if the data really is truly mission-critical, it becomes easier to justify the cost of going beyond traditional tape backup.
|Data and backup device mapping matrix|
Backup and restore times can be greatly affected by the number and size of the files you are backing up. Millions of small files pose a serious challenge for traditional tape backup products. When written to tape using typical backup products, a large number of small files can significantly impair the performance of even the fastest drives. It's not uncommon to see backup and restore speeds of 50KB/sec to 100KB/sec on tape drives rated to run at 15MB/sec to 30MB/sec. This dramatically increases backup and restore times and at the same time decreases the lifespan of the tape drives and media.
If you are aware of the fact that these kinds of small data files exist in your environment it will help in your preparations for successfully recovering your data. If you want to make the best use of your primary disk and tape backup destinations, a good first step is to classify data based on its characteristics (file size and file volume) and volatility because it affects incremental or cumulative incremental data movement. As a rule, disk subsystems provide optimal performance for large numbers of small files, while tape works best for small numbers of large files. The reason for this boils down to random vs. sequential access to data on the respective device types.
Classifying backup clients into groups based on data characteristics creates a logical basis for segregating backup workloads among different types of target storage devices (see "Data and backup device mapping matrix").
Creating a general matrix for mapping client types to device types is one practical way for backup administrators to optimize utilization of tape devices for backup and restore operations, in cases when disk is already part of the picture. Because every backup environment is unique in terms of its data and its hardware and software infrastructure, coming up with an ideal backup data classification system requires extensive planning, measurement and adjustment along the way.
|Common causes for tape failures|
Tape devices are among the most active devices in the data center and, as such, are very prone to different types mechanical problems. The majority of mechanical failures occur during startup or shutdown cycles, as opposed to steady-state operational failures. In a large tape library environment, thousands of start/stop actions occur daily among the library robotics, drives and tape media elements. Inevitably, things will go wrong (see "Common causes for tape failures"). However, the disastrous consequences of a failed restore can be avoided if you actively practice proper systems management. Properly managing tape infrastructure issues and controlling associated risks is key to this.
Too often, organizations don't realize that they will have a problem restoring data until it's already too late. "Too late" usually means that the data has been lost and now must be recovered from tape. Regular, random testing of your restore procedures can help you avoid this eleventh-hour problem. Restore testing will help you identify potential bottlenecks, problem clients/data sets or breakdowns in your process. Identifying these obstacles prior to actually needing the data gives you sufficient time for tuning your environment and to prepare for the inevitable restore request. By simply integrating restore testing into the application delivery process, you gain preproduction exposure for potential recovery pitfalls. As data volumes grow, ongoing recovery testing should be given equal weight as performing software upgrade tests in any mission-critical application environment.
Prepare for disk
Over the next several years, it's quite possible that disk-based backup could potentially displace tape as the primary backup media. Until then, it is extremely critical to make sure that your current tape infrastructure continues to support all of your recovery needs. Restores are not just about technology: Overall operational procedures are also essential to successful backup and restore operations.
Rapidly introducing disk to replace a poorly managed tape infrastructure is a bad idea because it may very well exacerbate existing issues in your environment, at a high cost and at a great risk to your organization. Inadequate management or weak operational practices are often the root cause of the instability of tape infrastructures. As inadequate practices are rolled into a disk technology base, the underlying management problems will manifest themselves not only as restore failures, but they may also affect your entire backup infrastructure, seriously threatening the integrity of your data.