Published: 26 Sep 2008
| Identify and destroy bottlenecks with the right speeds and feeds.
Before you start throwing hardware or software at a problem, a crucial first step is identifying the potential bottlenecks. Doesn't that sound simple? Maybe, but identifying and removing bottlenecks requires a deep understanding of how data flows through a backup environment, and there exist numerous opportunities for the backup process to slow to a crawl. Your mission is to identify and eliminate bottlenecks from your environment.
Consider the following company struggling with slow backup performance. Despite having a virtual tape library (a dual-headed VTL unit with aggregate performance up to 800MB/sec), multiple tape drives (25 drives with a native throughput of 30MB/sec) and several media servers, its aggregate backup performance was less than 200MB/sec. That may be acceptable for many environments, but this firm had nearly 75TB of data it wanted to back up in a 24-hour window. To do that, it needed to quadruple its throughput. Its million-dollar infrastructure was performing slower than a single LTO-4 drive. Not good.
The customer was told that it was time to investigate four key areas to find the bottlenecks. They are, in order of importance:
It's unlikely a backup administrator will have the ability to modify the client hardware, but it's critical to understand what you're dealing with from the client side. Specifically, it's important to know how many clients exist, and to have a clear picture of their network connections to the backup servers. While most environments today run gigabit networks, some individual servers still use 100Mb/sec NICs. Depending on the amount of data these slower hosts protect, these outdated connections may not be adequate. For example, this customer discovered that several of the clients were still running 100Mb/sec NICs. Most of the hosts were small enough that this didn't matter, but one problematic client had more than 700GB of data.
While it's possible to back up this host in less than 24 hours, a general rule of thumb when examining clients is that any system with more than 500GB of data is a potential SAN client. Rather than backing up over the IP network, a SAN client is configured to back up across the SAN fabric directly to the backup media. Leveraging SAN-based backups for large clients eliminates the IP network and media servers as potential bottlenecks. While it may increase the CPU load on the client servers, most users are willing to sacrifice that additional CPU overhead for faster backups and, more importantly, faster restores. In this particular case, the firm installed an additional host bus adapter (HBA) in this problem client, and shortened the backup time from 20 hours to two. This meant restore times for this host were dramatically reduced.
Once you've identified and eliminated bottlenecks at the client level, it's time to focus on the actual backup infrastructure, starting with the backup servers (including master and media servers). I'm often asked how much memory and how many CPUs these servers need. While CPU and memory are critical components of the servers, they're rarely the source of bottlenecks (that said, if you measure RAM in megabytes or CPUs in MHz, it's time for a change). More often, the bottlenecks can be traced to the number and speed of the network connections.
At the front end of the server we have the IP connections. Insufficient IP connectivity is the most common bottleneck in the backup infrastructure. Many customers look at media servers as just another piece of hardware. Most often, they'll purchase a standard build that may include a dual-ported gigabit NIC. While this may be sufficient for most production servers, it can create serious throughput issues in a backup environment.
For example, if you've remediated your biggest client-side bottlenecks by implementing a dedicated backup VLAN with gigabit connections, it's possible that just two backup clients could consume all the network bandwidth into the media server. Admittedly, these would have to be heavy-duty client machines to max out the incoming data pipe, but it's easy to imagine a scenario where 10 clients, all with gigabit connections, back up to a single media server with two gigabit connections. In that scenario, the clients will be competing for bandwidth at the media server level. In environments with hundreds or even thousands of backup clients, you might need several dozen media servers with multiple trunked network cards just to provide enough network bandwidth to meet the backup window. Alternatively, as 10Gb/sec networks become more common, one of the first systems to upgrade would be the media servers. Upgrading the front-end connectivity for these systems shifts the bottleneck away from the IP network.
After resolving the end-to-end IP network bottlenecks, it's time to examine the backup targets and connections. Everyone's talking about backup-to-disk technology, but tape still plays a critical part in most backup environments. While many environments hope to go tapeless in the near future, many struggle to meet this goal due to extended retention times or tremendous amounts of backup data.
Most backup architectures still contain a sizable tape infrastructure, but hitting the performance benchmarks for these drives becomes a challenge. Consider this: LTO-4 drives, at 120MB/sec native (up to 240MB/sec at 2:1 compression) are faster than a single gigabit Fibre Channel (FC) connection. While most SAN environments are moving toward 4Gb/sec or 8Gb/sec FC, many are still running 2Gb/sec, especially on older backup servers. In those environments, deploying LTO-4 drives will hurt performance unless special attention is paid to the fan-in ratio (the number of drives per HBA). To determine the appropriate fan-in ratio for your environment, it's critical to understand the speeds and feeds of the components. For instance, a 2Gb/sec HBA has a maximum throughput of 256MB/sec, which is equal to a single LTO-4 drive with compression. Based on that statistic, you would need a 2Gb/sec HBA for each LTO-4 drive you deploy. Even with 4Gb/sec HBAs, the fan-in ratio should not exceed 3:1 (drives:HBAs).
Many purchasing departments cringe when they see a server order that contains numerous 4Gb/sec HBAs just for tape plus another set of cards for disk connectivity. But if you want your new tape drives to function at their intended speeds, you must provide sufficient connectivity. The customer discussed earlier was planning to purchase 20 LTO-4 drives for its media servers, but hadn't planned to upgrade the server or the HBAs. As a result, the media servers would have had a fan-in ratio of 10:1 on 2Gb/sec ports. When data was written to tape, the media servers would be unable to keep the drives streaming, resulting in what's known as the shoeshine effect. By upgrading to 4Gb/sec and reducing the number of drives on each HBA, that customer was able to maximize the performance of the new drives.
Performance tuning is an art, but identifying infrastructure bottlenecks is more of a strict mathematical exercise once you know the important numbers. Understanding the source of existing and potential bottlenecks makes it easier to find and remove them. Sure, it may require an investment in hardware to remediate them, but improving the speed of backups in an IT environment that's growing and changing is critical to meet the needs of the business. Do the math, and start hitting your backup windows. Then you won't have to worry so much when your phone rings.