Published: 12 Apr 2007
The outdoor apparel and equipment outfitter moves IBM Tivoli Storage Manager from its mainframe to Linux servers,...
and adds SATA disk to its backup environment.
Sandy Rideout, a storage engineer at L.L.Bean Inc. in Freeport, ME, had become increasingly unhappy with the outdoor gear retailer's storage backup and restore environment. The system was built around IBM Corp.'s Tivoli Storage Manager (TSM) and ran on the mainframe under z/OS, but was backing up more than 240 open-systems nodes--Windows, AIX, Linux, Sun and NetWare.
Rideout had long suspected mainframe constraints were slowing down the backup, and things came to a head in the summer of 2005. "We had instances of excruciating, long restores," she recalls. "In some cases, not all the data was recovered." Open-systems backups were a particular concern.
The problems propelled L.L.Bean's IT group into a three-phase initiative to overhaul its entire approach to mainframe and open-systems backup and recovery. The project took a team of four storage and systems engineers (with some early help from an outside consultant) less than a year and came in at considerably less than $1 million.
On average, the company was backing up 560GB each night (890GB at peak). It was running 35TB of disk storage and 1,550 Magstar tapes under TSM. And going forward, L.L.Bean expected to grow considerably.
L.L.Bean's mainframe backup had its own problems. "There were no consistent point-in-time backups," says Rideout, adding that "we had different backup processes in place and ended up running multiple backups of the same data." Compounding the problems, applications were running backups on their own. Critical production data was being replicated synchronously via EMC Corp.'s Symmetrix Remote Data Facility (SRDF) to remote data centers for business continuity--but that was the only data being fully protected in that way.
The company was running two Amdahl Corp. plug-compatible mainframes that were subsequently replaced with the IBM eServer zSeries 900 (z/900). Constrained mainframe resources weren't the only culprit. The problem, Rideout suspected, was a fragmented, disjointed and uncoordinated backup and recovery process.
Rideout proposed a project to establish an enterprise-wide approach to backup and recovery of all centrally stored corporate data. At the least, it would improve efficiency, allow greater consistency and reliability, and produce cost savings. With the summer of 2005 experience fresh in their minds, IT management agreed. After immediately securing the sponsorship of L.L.Bean's vice president of information systems, Rideout began a three-phase project--assessment, policy and recommendations, and implementation--to revamp the backup and recovery system.
|Fixing mainframe Tivoli Storage Manager backup problems|
Phase I: Assessment
L.L.Bean's direction with open systems is to move to Linux, but the mainframe is there to stay. The new backup and recovery approach will encompass all company data, open systems and the mainframe.
"The assessment phase was key to the whole project," says Rideout. It involved evaluating the existing backup and recovery infrastructure with a focus on mainframe backup operations and the open-systems TSM environment. "We spent several months just gathering data," she says.
This entailed interviewing mainframe and open-systems people, using the TSM reporting tool to capture data at different time periods and analyzing mainframe SMF records for various time slices. The team was surprised by what it learned. "One problem we had in the summer was running multiple TSM instances on the mainframe to back up portions of the open-systems environment," says Rideout. That created contention for CPU. If it was a problem in August, it would be far worse when business peaks in November and December. "CPU becomes a hot commodity then and there's a lot of contention," she adds.
The team found other troubling issues. For example, all file and application servers are treated the same in terms of recovery time objectives (RTOs). The company was backing up and retaining data for the same amount of time whether it was a production system or a development system. The team also discovered that TSM backups averaged only 8MB/sec, whereas the native speed of the IBM TotalStorage 3590 enterprise tape drive should be 12MB/sec. Restoring data was equally slow.
Although the team generally liked how EMC's SRDF performed synchronously replicating mainframe data, they had concerns about SRDF's tendency to propagate mistakes in the data and its inability to provide logically consistent data. In the event of a rolling disaster such as a security breach or code error, SRDF would replicate bad data to the remote site.
|Decision matrix: Virtual tape library vs. SATA tapeless backup|
Phase 2: Policies and recommendations
The assessment indicated new data retention policies were needed. "We realized we had to have separate policies for production and development," says Rideout. The team reviewed a year's worth of restore history to create its new retention and recovery time policies.
For example, they recommended an onsite RTO of two hours to eight hours for individual files, eight hours for a file server OS, 12 hours for a database server and 24 hours for an email server. They set data retention policies of 14 days for a database server (because the data changes so frequently), 30 days for development systems, and 90 days for email and production systems.
To establish a consistent backup and recovery process, the team recommended such things as consistent configuration of all TSM clients, use of an enterprise job scheduler, and dedicated staff resources to manage and maintain the backup and recovery environment.
The team also decided to stay with TSM. "At the beginning, we talked about other backup utilities. TSM had worked well. Our problems weren't due to TSM," says Rideout. "Also, we had a lot of knowledge of TSM in-house." L.L.Bean will use TSM for open-systems backup and recovery; it handles mainframe backup through IBM's hierarchical storage management (HSM) utility.
Two major decisions revolved around disk backup and the choice of SATA or a virtual tape library (VTL). If they opted for disk backup over tape, they'd have to use a low-cost SATA array in conjunction with a tapeless TSM environment or a VTL product.
Previously, L.L.Bean had always done backup to tape using an IBM 3494 ATL connected via ESCON to the mainframe. The ATL used Magstar tape drives and the company had eight ATL frames holding a few thousand tapes, some dedicated to HSM data and others to TSM data. "We were already running out of room," says Rideout. "We had gotten to the point where we would have to eject tapes to make room for new tapes." If they expanded the ATL, however, they would have to add frames and upgrade the connection to FICON.
The tape space crunch simplified the choice. "We could take TSM off the mainframe and gain all the ATL space for HSM. It would extend the life of the ATL," says Rideout. The team recommended moving the TSM open-systems backup to disk. But the question remained: Which disk backup format should they choose: VTL or SATA disk array? The difference came down to VTL, which spills over to tape automatically, or a totally tapeless approach that relies solely on SATA disk (see "Fixing mainframe Tivoli Storage Manager backup problems").
"This was a very tough decision, but we had to be able to grow," says Rideout. To make the choice, the team put together a decision matrix identifying the advantages and disadvantage of VTL and SATA disk. For example, VTL could be used for development if recovery wasn't going to be a high priority. When you run out of capacity with VTL, however, "you need to bring in a new frame," says Rideout. That brought the team back to its desire to avoid adding frames.
SATA arrays required more monitoring and active management to make sure file spaces didn't fill up or to identify hot spots. They're more flexible than a VTL in terms of adding disk and increasing capacity, support recovery at the volume level and provide a faster restore time (see "Decision matrix: Virtual tape library vs. SATA tapeless backup," this page).
The team recommended, and management accepted, moving open-systems backup off the existing ATL to tapeless TSM in conjunction with SATA arrays. At the same time, L.L.Bean opted to go with Linux over AIX. It would deploy two TSM servers, one for production backup and one for testing/development.
With the open-systems backup decided, the team turned its attention to mainframe backup. Here the team's main concern was its reliance solely on SRDF to mirror the data to its remote backup site. Under conditions like rolling disasters caused by a security breach or human error, the company could experience data integrity issues due to the synchronous mirroring of corrupted data.
"We realized we could add EMC TimeFinder/Clone to eliminate rolling disasters," notes Rideout. The company would continue using HSM to back up data, even with SRDF and TimeFinder pumping out copies of the data to the remote site. TimeFinder's consistency group capability would also protect the production SRDF DASD against a rolling DASD disaster, as well as maintain data coherency across TimeFinder-protected application and database volumes (see "L.L.Bean's backup objectives," this page).
"The combination of SRDF and TimeFinder will work very well in L.L.Bean's situation," says Greg Schulz, founder and senior analyst at The StorageIO Group in Stillwater, MN. "If they want even more protection, they could add something like one of the GoldenGate Software [Inc.] products." GoldenGate captures changes to database data before they hit data buffers, logs or disk.
|L.L.Bean's backup objectives|
Phase 3: Implementation
The implementation of the new backup process proved to be the simplest part. It consisted of acquiring the new hardware and then migrating the nodes to be backed up.
The migration task was simplified by the initial assessment. "During the assessment, we listed all the nodes that needed to be backed up," says Rideout. The team assigned each node--a total of 178--to a system administrator. Once the hardware was in place, the team proceeded to move three to five nodes each day. The whole process took approximately three months.
L.L.Bean ended up doing very little actual moving of previously backed up data. "We would switch a node and it would back up to the new system," says Rideout. Based on its previous retention policies, it could remove the old backup data in 90 days or 180 days. As that data was removed, L.L.Bean freed the tape in the ATL for mainframe HSM backup. "Most of the migration was just attrition," notes Rideout. Simply by redirecting nodes to the SATA disk and backing them up as the team normally would, over time, they would have all the backed up data in the new system.
The mainframe implementation required the most attention; the team had to implement new tools like TimeFinder, but the implementation wasn't disruptive or difficult. By August 2006, the new backup and recovery process was fully implemented and operating without a hitch, 11 months from start to finish, according to Rideout. More importantly, she says, the "backup and recovery performance exceeded the criteria we had set." This success contrasted with a previous effort. "We attempted to do something like this on a smaller scale three to four years ago and we just couldn't get it off the ground," says Rideout.
Looking back, it seemed so easy, but Rideout had learned a few lessons from the previous effort that ensured success this time around. "The first time we lacked a corporate sponsor," she says. This time, she had the vice president of information systems lined up from the beginning. She also insisted that a project manager be dedicated to the project.
Finally, the success of the assessment phase--the team's ability to gather detailed data--paved the way for everything that followed. The assessment data enabled the team to develop better retention policies and alerted them to the looming tape crunch.
In terms of performance, L.L.Bean has reduced retail database backup from 7.5 hours to two hours. In addition, file-server restore has been reduced from 10 hours to 1.5 hours.
Buoyed by the project's success, the L.L.Bean storage team doesn't intend to stop at backup. "We have all this SATA capacity. I'd love to look at archiving," says Rideout. L.L.Bean currently doesn't archive email, but with 34TB of SATA capacity, Rideout thinks it should.