Published: 12 Jul 2004
In the newspaper business, bad news sells. When it comes to backup, it's easy to focus on the bad news. There is simply so much of it: nightly failures, lost tapes and unrecoverable data.
But the news isn't all bad. There are shops where backups are completed successfully, where data is restored and backup operations actually run smoothly. The most evident common denominator in well-functioning backup infrastructures is effective process and control. Well-run environments have a clear understanding of the tasks to be performed and a consistent way to accomplish them.
How does your organization stand in regard to the basics of backup operations? Here's a checklist of 10 areas you should focus on in order to build a more effective backup practice.
|The backup operations lifecycle|
Your backup infrastructure needs to be factored into the planning process for rolling out applications, servers and primary storage growth. Too often, changes in the environment aren't taken into account until the eleventh hour. This results in disruption, and has a detrimental impact on the overall backup operation.
Proper planning also enables the backup team to fully understand an application's business requirements and design characteristics with respect to data protection. The backup policies and approach necessary for a database application that employs split mirrors and replication is considerably different than those needed for a file-based environment having no additional data protection. Similarly, a large enterprise application deployed across multiple servers may have complex data interdependencies that require proper backup synchronization in order to enable a usable recovery.
2. Establish a lifecycle operations calendar. An effective backup operation requires that certain tasks be completed successfully every day. However, there are also weekly, monthly, quarterly and even annual tasks that are every bit as important as the daily tasks. While short-term tasks are highly tactical, long-term tasks tend to be more strategic. In an effective backup operations environment, all should be documented and performed on schedule (see "The backup operations lifecycle").
The daily tasks are the operational fundamentals that most backup administrators are familiar with and include items such as:
- Job monitoring
- Success/failure reporting
- Problem analysis and resolution
- Tape handling and library management
- Performance analysis
- Capacity trending and planning
- Policy review and analysis
- Recovery testing and verification
- Architecture planning and validation
3. Review backup logs daily. A review of backup application error and activity logs is a key daily task that is often easier said than done. Log analysis is frequently time-consuming. However, it can pay extremely valuable dividends, and is essential to reliable backup.
Backup problems tend to manifest themselves in a cascading effect. One event results in a series of subsequent problems that don't have an immediately obvious linkage. For example: A backup job doesn't kick off because a required tape drive was never released from an earlier job. This prior job was backing up an application server executing an unscheduled batch process, consuming system resources and causing the backup to finish late. The system administrator responsible never informed the backup administrator to reschedule the backup.
It takes considerable skill and detective work to determine whether or not one is observing a root cause or a symptom of some other problem. You also must establish good communications and working relationships with system administrators, DBAs, network administrators and others to effectively troubleshoot complex problems.
4. Protect your backup database or catalog. All backup applications maintain a database or catalog that is absolutely critical to the recovery of data that is backed up. Lose the catalog and you've lost your backups. While some backup applications have mechanisms for reading through tapes and recovery indexes, this can be an onerous--if not impossible--task.
The catalog should be treated like any other critical application database. It should preferably be mirrored--or at least RAID-protected--and you should verify successful multiple-copy backup of the database or catalog on a scheduled basis.
5. Identify and resolve backup window failures daily. Backup window failures are successful backups that exceed the expected backup window. Because the backup job itself completes, no errors are reported in the error log. Therefore, this problem is often overlooked. In addition to affecting production environments and creating user dissatisfaction, jobs approaching or exceeding the backup window can be warning signs of impending capacity limits or performance bottlenecks. Recognizing and addressing these issues as early as possible can prevent future failures and avoid user dissatisfaction.
6. Locate and back up orphan systems and volumes. Your backup software invariably provides you with some level of reporting information about daily backup success. If you depend on this as the authoritative source on backup, then you are likely still at risk.
The backup application reports only on the servers that it knows about. Large environments often have orphan systems--systems that have been brought into production but not incorporated into the backup schema. This can happen for a variety of reasons, and is often the result of a business unit purchase of a system outside of IT's purview that may have been backed up independently, but over time falls through the cracks. Usually these systems are discovered after it is too late: Data loss occurs and a restore request comes to IT for a system it knows nothing about.
Addressing this problem can be challenging and time-consuming. It entails regularly discovering and mapping new network addresses to nodes, filtering out unrelated addresses (e.g., additional NIC cards, network devices, printers, etc.), identifying the locations and owners of these nodes and establishing policies for managing the addition of storage volumes. Regular reporting to system and application owners of exactly what's being backed up as well as what's not being backed up (by choice) is also critical.
7. Centralize and automate the backup management as much as possible. A key to successful data protection is consistency. This does not mean that all data must be treated in the same manner. However, it does mean that all data of equivalent value and importance to the organization should be managed in a similar fashion. The orphan problem I just described is an excellent example of an inconsistency that can result from non-centralized backup administration.
In many environments, backup operations for Unix and Windows servers are run independently. This organizational alignment may predate networked storage. Does this still make sense? Besides being inefficient, it suggests that a different set of policies and procedures should be applied to data based on its operating platform. Is there any line of business owner that would apply that measure to data valuation?
Geographic considerations and functions within backup operations can be delegated, but given communication capabilities and management tools available today, there is little justification for decentralized backup.
As the complexities of the backup infrastructure grow, automation can help by providing tools to facilitate repetitious processes. As discussed earlier, tasks such as checking logs on a scheduled basis are key. Deploying automation to provide automated alerts for previously identified errors in logs can make life easier. The inverse is also true--providing automation to aggregate repetitive error entries in a log can be helpful.
In an unadulterated log, if I see one SCSI error, I see 1,000 of them. Scanning through all the entries of the same error can be daunting--so much so that I may be tempted to not perform the necessary daily scan of the log. Automation tools can successfully facilitate various activities if you identify the task to be performed as well as define the expected result.
8. Create and maintain an open issues report. Finding and fixing problems like the ones I've discussed are tactical activities critical to backup success. However, the process of managing those problems effectively and establishing appropriate metrics indicative of backup quality is essential to drive systemic improvement of backup infrastructures.
In larger environments, problems may be tracked through a formal ticketing system. If you don't use such a system, an open-issues log can be an important tool to help a backup operation evolve from fire-fighting mode and ensure an optimized steady-state operation. Either way, regular reports detailing open problems that indicate the rate at which new problems are added and existing problems are closed can speak volumes about the overall health of the backup operation. A simple trending report with appropriate supporting data can uncover fundamental operational problems and help reach an appropriate resolution.
9. Ensure that backup is integrated with the change control process. Backup environments are by their nature highly dynamic. Unfortunately, within backup organizations, too often the change process for backup is equally dynamic. Just as backup must be part of the strategic planning process, on an operational level, backup must be part of an organization's formal change control process.
This implies a two-way relationship because both changes directly and indirectly related to the backup infrastructure must be part of the notification, impact assessment and contingency planning process that is included within change control. Stories abound of unintended backup outages due to storage area network (SAN) switch topology or zoning changes or system bottlenecks due to backup configuration modifications. They can and should be avoided with the proper process in place.
If a monthly outage window is necessary for the backup infrastructure to facilitate upgrades or verification tests, then this outage window shouldn't overlap with outage windows for other production systems. There's an increased demand for restores when changes incur in the systems as system files are upgraded and backing out of change is desired. If the backup infrastructure is down for maintenance at the same time, data can't be restored in a timely manner. The backup infrastructure is a production system, just like the most important application in an organization's environment, and it requires the same respect and support.
10. Leverage your vendors effectively. Backup environments are complex and get more so with the introduction of new technologies. Hardware and software vendors are racing to add new features and functionality in the struggle to differentiate themselves from one another.
While much of this technology can be helpful, and certainly it all sounds good, there's a considerable challenge in understanding nuances of functionality of one technology option vs. another. For example, there are a significant number of different approaches to disk-based backup. Which one is right for your environment and what precisely is the impact?
A fundamental question that you must be able to answer is: Does your vendor have the right skills to support your needs? All technical problems get resolved--eventually. If your technical problem isn't being resolved in a reasonable amount of time, then you may not be working with the right vendor. This becomes extremely apparent when multiple products from multiple vendors are integrated.
These ten tasks may sound basic, but accomplishing them isn't always easy. They depend on a number of key elements: appropriate reporting and measurement capabilities, a high degree of staff competency within the backup organization and solid cross-functional communication. The impediments can be significant, including costs, resource availability and skill levels, organizational politics and a host of others.
If you can't accomplish all of these things, try to address the most critical. If time and resources are the issue, develop a plan to justify them. Against these hurdles you must be consider the risk of unrecoverable data and major outages. After all, the news is full of those kinds of stories.