Let's assume that the most basic trouble areas have been verified and eliminated as probable causes -- there is power in the server room, the automated tape library (ATL) is connected to a power outlet, it is switched on and is hooked up to the backup server. Once this has been established, we can start looking at other possibilities.
What has changed? This should be the first thing to ask yourself to determine whether you are experiencing a hardware failure or some type of configuration issue. If no changes were made and a problem surfaced, the system's error log should be reviewed for possible error messages. Enterprise class tape libraries also have error logging capabilities, and good backup software will usually log device-type errors. This type of failure is usually resolved through hardware vendor support.
Bad drive or bad tape?
Before assuming a drive is down or a tape is bad, the tape in question should be mounted in another drive and/or a tape known to be good should be read from the suspected faulty drive.
Drivers and firmware
Upgrading to a newer version of a tape device driver can sometimes conflict with the firmware on a tape device. The vendor's "read-me" should actually be read for the latest information.
For those installations still using SCSI devices, a number of items should be verified, such as:
Fibre Channel libraries
Fibre Channel attached libraries add new complexity and potential for configuration errors, such as:
Some libraries use one of the tape drive connections to issue commands to the robotic arm (gripper) to mount or eject tape media. This is usually configured from the library menu and the selected tape drive number must be matched when configuring the devices to the backup server. For example, if the first drive is selected as the control path (i.e., /dev/rmt0), the control device number should match (i.e., /dev/smc0).
Element numbers are used to identify specific physical library components. Each tape slot and drive has an assigned element number. When configuring tape drives to a backup server, the element number must match the logical device defined to the operating system to prevent the robotics from loading a tape in one physical drive (element #) and the operating system trying to write to another drive (logical device). ACSLS libraries use drive ID numbers, which is a similar concept.
Some vendors supply utilities to test ATLs (i.e., IBM tapeutil and Veritas robtest). These utilities come in handy to isolate certain configuration errors that might be difficult to diagnose at the backup software level. Such utilities allow you to view the device configuration at the operating system level and issue basic commands, such as mount, rewind, unload and eject. Unix operating systems also allow some basic communication with devices using commands, such as DD and IOCTL that can help determine if device is operational or simply configured incorrectly to the backup software.
As with any other hardware issues, a methodical approach must be taken when troubleshooting ATLs:
About the author: Pierre Dorion is a certified business continuity professional for Mainland Information Systems Inc.
This was first published in October 2007