What are the major causes of tape backup failures?
There are a wide variety of reasons why data backup or recovery operations fail. Backup and restore operations may fail due to people, hardware, software or media errors. There are essentially six major causes of backup and restore failures:
- Human errors
- Software errors (device busy, in use or other error conditions that may be recoverable)
- Resource contention (resources needed for backup in use such as an open application or file data, I/O channels, or tape library, drive or media)
- SAN or other connectivity problems
- Tape drive or media error
- Hardware errors (disk drives, processors, memory, I/O controllers, etc.)
The most common cause of backup and recovery problems are those caused by human errors. Examples include running multiple jobs simultaneously, failure to change tapes, forgetting to remove a cleaning tape or any one of a number of issues.
Next on the list are software errors, typically with scripts used to run backup operations, mount media or other operations. Resource contention can occur due to unforeseen or unpredictable events, or because of a lack of planning on the part of backup operators. Connectivity problems may occur due to SAN zone changes or other issues effecting connectivity. Finally, hardware-related items are the least likely to cause problems, which include tape media, tape drives and other hardware involved.
The perception portrayed by some vendors is that tape media is one of the most common causes of backup problems. The reality is that tape drive and media issues are typically not the major cause of failures. However, tape media failures do occur and can have a significant impact; in some cases tape media errors aren't recoverable.
Applications and tape drives may be unable to recover from media errors, even though the actual data loss is a small percentage of the tape. Thus, even small dropouts may result in loss of access to a large amount of data on tape. Moving from tape- to disk-based backups can improve reliability and eliminate this data loss. However, the other more common problems remain, regardless of whether backups are going to disk or tape.
Addressing the causes of backup failures is prudent. But the issues should be addressed in the order of frequency. Because human errors are the most likely cause for problems, automating backup processes can reduce a large amount of errors. Ensuring well-proven software and scripts are used to run backup operations is the next issue to address. Configuration management can help ensure that resource contention and connectivity issues don't lead to data backup problems. Only after the primary causes of backup have been solved does it make sense to address tape media problems for backups.
This was first published in June 2008