Backup-to-disk performance tuning

Disk-based backup can lower costs, reduce complexity and add scalability. But to achieve top performance, you'll need to do lots of benchmarking and watch for poorly configured production storage.

The key to achieving top backup-to-disk performance is to fine-tune the storage configuration and conduct benchmarking tests to determine the best combination of production and backup storage configurations.

Not long ago, a company asked me to help redesign its backup system. What prompted the request was a corrupt file system that took more than 36 hours to restore from its four-year-old tape backup system. Data growth at the company had increased between 100% and 200% per year.

The company wanted to look at new technology to protect its data. The usual questions were asked: How much will it cost to upgrade our current tape resources? How many more tape drives will we need to meet the growing backup window? We looked at all of the company's data protection requirements, including the length of the backup windows, restore speeds and internal service-level agreements. Several solutions were presented, ranging from entirely new backup technologies like Axion from Irvine, CA-based Avamar Technologies Inc., to upgrading or replacing existing DLT tape libraries with faster LTO, to leveraging existing backup software and using disk for backup storage.

Because the company had invested hundreds of thousands of dollars in its existing backup software and hardware, it wanted to see that investment leveraged in some way. There was also a concern about replacing tape drives that still worked and the painful process of upgrading from one tape drive technology to another. Alternative backup solutions such as Avamar's Axion with commonality factoring, although great for the right environment, didn't help this firm. After looking at all of the existing backup software solutions on the market, there wasn't much justification in displacing the firm's existing Veritas NetBackup software from Symantec Corp.

At that point in the review process, the focus changed; should we upgrade the company's tape drives or replace them with a SATA disk system?

After looking at products from several tape and disk vendors, it was determined that the cost to upgrade the tape libraries was more than the price of a comparable disk-based solution. The company had a specific goal: Eliminate the need to vault tapes offsite and restore up to 1TB of data in less than eight hours. Our proposed solution was to back up all data at the company's primary data center and then send backup copies to a second site for disaster recovery protection.

Disk advantages
Disk-based backup offers several advantages over tape:

  • Lower cost
  • Increased reliability and speed
  • Reduced complexity
  • Scalability
Because of this company's high rate of data growth, the scalability offered by disk was crucial. With its Unix backup server using Symantec's Veritas Storage Foundation, it could grow its backup storage on the fly with zero impact to backup operations. By analyzing existing level-zero and incremental backups, it was possible to size a backup system to accommodate data-retention requirements. The final system design included two 20TB Hitachi Data Systems (HDS) 9570V modular storage systems (one at each site), with McData Corp. Eclipse 1620 SAN routers (McData is in the process of being acquired by Brocade Communications Systems Inc.). The backup server, media servers and primary site backup storage were added to an existing SAN, while the McData Eclipse 1620 SAN routers connected the second array at the remote site to the primary production SAN. The existing tape resources will now be used for backups with retention past three months and for archiving (see "New disk-based backup system," below).

Because backup to disk was new technology for this company, it wanted to evaluate any proposed solutions before making a purchase. This was a particularly difficult requirement because most storage vendors won't allow customers to evaluate their disk systems in their data centers. But most vendors will offer the use of their integration labs, which are usually quite extensive. However, it's difficult to reproduce the customer environment exactly down to the block level. So HDS offered a "try and then buy" option.

Lessons learned
We learned the following lessons during implementation, testing and debugging:

  • Watch for poorly configured production storage.
  • Disperse I/O across the SATA array with more RAID groups and LUNs.
  • Plan for adequate downtime for production servers when adding new backup storage.
  • Use striped storage volumes at the host layer for backup storage.
  • Enable active/active for pathing software for backup storage.
  • Benchmark several different storage configurations before pushing the backup solution into production to validate performance.
  • Restore speeds for disk should be factored at 1.5 times the speed of the backup vs. 0.5 times when using tape.
The primary hurdle for this project was the existing production storage configuration. There were bottlenecks--several large file systems composed of just a couple of very large LUNs--in the way production storage was allocated to some of the servers on the SAN. I anticipated this would be a minor problem for the overall backup performance, but still expected to achieve decent performance. Instead, this proved to be the top performance limitation. Even servers that had high-end storage on state-of-the-art Fibre Channel drives couldn't push more than 6MB/sec for multiple terabyte file systems on one or two LUNs. As a result, these file systems were reengineered with more LUNs for each file system. The new file systems went from pushing a mere 6MB/sec to pushing approximately 60MB/sec to 100MB/sec.

For optimal performance, LUNs created on the HDS 9570V for backups and allocated to backup servers would have to be spread over different RAID groups across the entire array to maximize I/O performance. By creating more RAID groups and LUNs, we doubled performance. We tested configurations using five RAID groups with five large LUNs, and 10 RAID groups with 10 LUNs. We found that more LUNs yield better I/O performance (see "How the number of LUNs affects performance," below).

Legacy tape backup environment

Implementation headaches
Integrating the storage on existing production servers required more downtime than the company expected. Planning for a minimum of three reboots per server and approximately two hours of downtime for each reboot is adequate, depending on the size of the server. I allowed for additional time on the larger back-end servers, which had more than 12 CPUs and exceeded 2TB of existing storage.

We also paid attention to the performance differences using striped or concatenated volumes. Once the LUNs--sometimes as many as 10--are presented from storage, a volume is created using the host volume manager. Best practices dictate that the best I/O performance is delivered by striped volumes.

Dual pathing on the SAN was also closely examined during the benchmarking phase. Depending on the product, most pathing software works in an active/passive configuration by default. With Symantec's Veritas Dynamic Multipathing (DMP), for instance, specific settings are required to operate in active/active mode. Using the vxdmpadm command, you can observe I/O on each path and determine the current mode of operation. Other pathing software isn't quite this sophisticated. Veritas DMP is also the easiest product to administer and has the widest range of support for disk arrays from various vendors.

During the benchmarking phase, the most labor-intensive task was changing the way the backup storage array was laid out for each round of benchmarks. Each new configuration on the storage array required the existing RAID groups and LUNs to be deleted and then initialized from scratch. This required detailed planning and a lot of patience. Each time the array was reconfigured, it took one day to reconfigure and then one day for the LUNs to finish formatting. Fortunately, the HDS 9570V lets you allocate and use the LUNs at the OS level before they're finished formatting on the array. However, it's not a good practice to perform benchmarks or write production data until the formatting is complete.

Benchmarking was critical to the project's success. Each benchmark included backup times from legacy tape backups and various backup-to-disk configurations. Some servers performed better with certain configurations than others. After examining all of the benchmark data, we designed backups for each server based on its particular results. For example, a server that had poor results backing up to disk with a specific storage configuration had increases from 6MB/sec to 100MB/sec. This wouldn't have been realized without benchmarking different storage configurations on the backup disk. With disk backups, restore speeds are normally twice that of backup speeds; our fastest restore for a single server was 180MB/sec.

For a file server, it's common to have data stored on random blocks across a storage array. During a backup to disk, the data is written sequentially. As a result, the read speed during a restore is substantially faster than tape. Even though data is written to tape sequentially during a backup, the tape-restore seek times tend to cause overhead that results in restore speeds that are 50% less than backup times. If you're multiplexing/multistreaming, it also adds more restore calculation times and causes additional delays.

The benchmarking testbed included a Web server, file server and database servers for Windows, Solaris and HP-UX. It's impossible to say that a particular server or OS performed better than another because of the uniqueness of the data on each server. For instance, a Windows file server will perform slower than a Unix database server because the file server works with random writes while a database is sequential. It's therefore more beneficial to explain the results we achieved on each OS and server type, and how we managed to improve various results.

The Solaris and HP-UX database servers performed the best for standard file-system backups. Our best Solaris server peaked out at approximately 100MB/sec or 360GB/hour for backups, and restored at close to 180MB/sec or 648GB/hour. It was an Oracle database server on nine Veritas file systems on nine LUNs, each getting its own backup job. On the same server, however, there was a 4TB mount point that couldn't be made to run faster than 6MB/sec because of the underlying storage configuration. This was the case across the environment; with a larger file system and fewer LUNs, backup and restore ran slower.

The Windows file servers had the poorest performance. The poorest performers used built-in Windows volume management with no striped volumes for data storage. As a result, any I/O was single-threaded and often competed with user activity. Our fastest Windows file server ran at approximately 5MB/sec no matter how many streams were used. Next up was the Solaris file server, which came in at a whopping 7MB/sec. A good way to increase those numbers is to implement Veritas Flash Backups or some other form of snapshot technology.

The bottom line is that the key to faster backup to disk is how the production and backup storage is configured. This company wanted a restore capability of 1TB within eight hours; it can now restore a terabyte in less than two hours on its fastest server.

Dig Deeper on Disk-based backup