Backup SLAs: The art of diplomacy

Negotiating backup service-level agreements (SLAs) can be one of the toughest elements of transitioning IT from a mere technology competence center to a real part of the business. Detailing what a backup service will provide, and figuring out how to measure and report on these promises, will greatly improve enterprise storage operations.

You need more than technical acumen to negotiate an SLA with your business units.

Negotiating service-level agreements (SLAs) can be one of the trickiest elements of transitioning IT from a mere technology competence center to a real part of the business. Detailing what a backup service will provide to its customers, and figuring out how to measure and report on these promises, will improve satisfaction and alignment; however, many IT professionals lose their cool when discussing service levels with business managers.

SLAs are a hot IT topic (see "The essential elements of an SLA"). They're a key element in transitioning IT from merely "something that has to be taken care of" to a critical business advantage. SLAs hinge on the proposition that proper feedback to a customer will yield better products. In other words, if IT can clearly articulate what they'll deliver, computer systems users will be better able to ask for what they need. And when people who know the business have their needs met more appropriately and efficiently by internal groups, the whole enterprise benefits.

Of course, this isn't how IT operates at most companies. Indeed, the recent high-profile stories of companies that failed to protect their data brought the discipline of data backup into the limelight like perhaps no other element of IT infrastructure. Judges are subpoenaing data restores, lost tapes are embarrassing companies and possibly compromising their customers, and natural disasters are exposing the poor data protection practices at many organizations. With all of these outside pressures on the business, IT staffers are subjected to daily demands to increase backup service levels.

The essential elements of an SLA
A service-level agreement (SLA) is a contract that codifies the requirements and expectations of all parties, both supplier and customer, for delivery of a class of service. It should specify the class of service tier required, detail the costs of providing that tier of service and can include penalties for failing to deliver the service. In general, SLAs include the following sections:
  1. Preamble, prologue or summary: Identifies and describes the document and governing policy
  2. Service: Codifies the terms of the agreement by identifying the class of service to be delivered
  3. Responsibilities: Outlines the requirements for implementation and identifies the division of labor
  4. Operations: Specifies support level, including provisioning, monitoring, escalation paths and response time
  5. Compliance and reporting: Identifies the method and frequency of reporting
  6. Appendix: Contains supplemental material, including any servers specifically excluded or with nonstandard backup windows, as well as other requirements

Transforming backup into a service
No discussion of SLAs can begin until the bigger issue of "service" is settled. You'll find that once IT is transformed into a real service provider, constructing SLAs becomes simple (see "The four parts of a service provider model"). The pieces of the contract naturally fall together once all parties agree on what should be included. But before you can agree to a service level for backup, you must transform backup into a service.

This is more easily said than done because backup is unusual among IT disciplines in that it's not in the critical path of operations. This means backup can fail day after day with no impact on the app's availability or performance, at least not until a restore request is made. This unusual aspect is shared with archiving, capacity planning and a few other disciplines, but it sets backup apart from the availability- and performance-obsessed data storage and server disciplines.

Today, ownership of backup responsibilities at most companies falls to either the storage or server groups. Bundling backup into storage seems to be the most prevalent direction, but that isn't necessarily any more appropriate than leaving backup responsibilities in the server group.

What not to say
Keep these points in mind when discussing service levels and service-level agreements:
  • Avoid denigrating names. No one buys a "small" coffee or travels in "steerage" anymore, and no one wants "class D" or "Tier-4" storage. Be like Starbucks and sell "big," "bigger" and "biggest," or "silver," "gold" and "platinum."
  • Focus on service. They're "customers," not "users."
  • Keep it positive. Focus on the ways in which the service will meet requirements rather than on what isn't included.
  • Be honest. Make sure customers are aware of the limitations of the service and are realistic about what to expect, especially in a disaster recovery situation.

Staff dedicated to backup
The single most critical requirement for good backup service is that it be given adequate resources and focus, no matter where it falls in the corporate hierarchy. Because backup isn't in the critical path, shared backup resources will always be distracted by production-impacting issues and can never be expected to deliver a high level of service. Therefore, the first requirement for backup SLAs is the assignment of dedicated backup management and staff.

At the turn of the last century, automobile manufacturers discovered that it was impossible to deliver a completely customized vehicle to all customers, so mass production with minimum customization was born. The same thought process applies to all IT services: A minimal set of standard infrastructure services must be defined to support end users. Think tiered storage and you're on the right track, but backup is slightly different. Instead of offering various technologies for backup, you'll offer various service levels. I'll describe the key service levels shortly, but the main point is that backup service offerings must be standardized as much as is practical.

For the service provider model to work, the backup service group must establish successful management practices and metrics to prove success. Backup is a highly repeatable discipline, much more so than most elements of storage management. Standard backup procedures must be uncomplicated; simply write down what your management team does on a good day, week or month, and attempt to follow these procedures every day thereafter. Once this documentation exists, fine-tuning can begin, which is an opportunity for everyone on the team to say how their job can be improved.

Although there's a wide variety of technical backup metrics, don't forget to create a set of key performance indicators for management processes. How quickly must you respond to user requests? How will you account for partial backup job success? Is a backup job completed outside the desired window still a success? Technical issues like these must be resolved, and the answers can vary widely depending on what end users expect from the service.

Separate what from how
Perhaps the most important line to draw in the sand when documenting service levels is to set unequivocal demarcation on who gets to make what decisions. Put simply, your job as the service provider must be to determine how to meet your customers' needs; in return, you must allow your customers to decide what those needs are without second-guessing them. This doesn't mean you can't guide them in the decision process and provide feedback on the impact of their choices, but the end users of the service must have the final say in what their service will look like.

Chaos ensues when these roles are confused. Giving a customer too much say will make the IT infrastructure unmanageable. Conversely, IT staffers who are out of touch with users are unlikely to build a system that will meet the real needs of the business.

When it comes to backup, the default (and incorrect) decision is to radically overprotect the business unit's data. Most IT staffers will overestimate the value of data, often citing nonexistent compliance policies. In one large company, for example, the outsourced IT staff decided to keep all backups forever to provide "better service," even though this was in direct opposition to the strategic decision of the company's policy chief. To their own detriment, IT members even tend to overspecify backup by building systems that can't function--leading them "back to the well" for additional hardware to meet requirements that never existed in the first place.

These pitfalls can be eliminated, or substantially reduced, when the "What vs. How" position is declared. IT can serve as a guide for the business, describing the merits and costs of the various options plainly but allowing the final decision to rest with those who know the data best. When this process takes place, the infrastructure team is invariably surprised by just how lax the actual requirements for data protection are on the whole. The majority of applications often need little more than regular incremental backups, and long-term retention of backup images may not be necessary (see "What not to say,").

There are exceptions, however. Certain applications may need transaction-level data protection and long-term retention. But the default backup scheme applied across the board at most companies (daily backups, weekly fulls and daily, monthly and annual retention of backups) fails to meet this requirement, just as it overprotects the majority of data. Currently, most data isn't protected correctly. If IT can get valid requirements from the business, existing backup resources could be reallocated and everyone could get what they need.

Simple steps to create a backup SLA
Creating service-level agreements (SLAs) can be a nightmare, especially when it comes to backup. Follow this simple recipe and you won't go wrong:
  1. Organize backup as a service, focusing on standard offerings that will meet the needs of the majority of the business.
  2. Declare that you'll be responsible for determining how to meet your customers' requirements and won't interfere with what they need.
  3. Document your service offerings in concrete, relevant terms and avoid technical jargon.
  4. Suggest a service level for each business area, discuss its merits and costs, and accept your customer's decision.
  5. Proactively report on your compliance with the SLA.

Speeds and feeds mindset
One of the greatest failings in IT is its focus on technology. Too often, we get so wrapped up in "speeds and feeds" that we lose sight of what the expensive technology is supposed to do for the business. This mindset is common in service negotiations: IT-written SLAs become ultra-technical, losing their relevance to the business. When developing an SLA, be sure to write the service requirements in terms that are less technical so more business-focused people will understand them.

Two terms always emerge when trying to rephrase data protection as business-speak: recovery point objective (RPO) and recovery time objective (RTO). The former refers to the amount of data that can be lost before a recovery, while the latter is the amount of time required to restore data. But these terms are inadequate for the SLA discussion with business units. Instead of relying on imprecise jargon like RPO and RTO, it's much more beneficial to bring the data protection discussion down to earth (see "Simple steps to create a backup SLA," at right).

The primary purpose of backup is to recover data after an outage, while its secondary purpose is to recover archived historical data after an extended time. Your discussion of service levels should focus on the following questions without veering into the realm of jargon--simply talk about the data.

  • What applications do you use?
  • Are there any parts of the day, week, quarter or year when these applications are unused or particularly busy?
  • How important are these applications to your job and to the company as a whole?
  • What impact would there be if a moment, minute, hour, day or week of this data was lost?
  • How long could you or the company wait for this data to be restored in the event of a serious computer problem?
  • Is there any benefit in having an historical copy of the data from a day, week, month, quarter or year ago?
  • How much money can the company realistically afford to spend to achieve these goals?

It's best to have your own ideas about the answers to each point before having this conversation. You could even phrase these questions in the form of a statement, such as "I understand that e-mail and the CRM database are particularly important to your department." Just make sure you don't pre-empt an honest dialog or let your own notions of the answers interfere with the progression of this negotiation.

Once this discussion takes place, most of the elements of your SLA will have fallen into place. Take your notes from the meeting and fill in the SLA form yourself. Then return with the completed SLA and verify the elements in plain language such as, "We decided that a week's worth of content in the recruiting database could be re-entered if needed."

The four parts of a service provider model
There are four essential elements to the creation of a service provider model for IT:

Service. A service focus separates the "what" from the "how," giving IT the latitude necessary to make architecture decisions while ensuring the business gets the service it needs. A service-level agreement between IT and technology users provides a pragmatic basis for aligning IT capabilities with business objectives.

Standards. Standard services are critical for the scalability and supportability of an IT environment. A stratification of service offerings allows different service-level requirements to be satisfied at appropriate cost levels.

Practices. Mature management practices (that include the processes, policies and organizational model) are critical in creating a dependable service. As processes mature, they become repeatable, documented and measured, and are continuously reviewed for improvement.

Metrics. External and internal metrics define the progress of the service model. These metrics can be used to develop a cost model to help business units understand the true cost of service delivery.

Let users know what's happening
Once the business unit signs off on the SLA, IT must meet the objectives laid out. This will often require some reengineering and reconfiguration, especially when the service is introduced. But as noted earlier, an equally important element is determining key metrics and reporting the success of the service.

The best advice I was ever given as an IT infrastructure manager was to proactively create metrics to show the world how well everything was working. This feedback is critical to creating a lasting partnership between IT and the rest of the business. Once again, feedback to the business units doesn't have to be more than a few plain-spoken, easily understandable metrics--reserve the technical jargon for consumption by IT staff. While a backup system operator will be keenly interested in call turnaround, escalation time, job success rates and the rest, your customers only care that the service is working.

Remember, backup is out of the critical path and issues can be swept under the rug. This makes proactive reporting even more critical--unless everyone is constantly made aware of the backup system, lingering issues can become critical service lapses.

One exception to the simple feedback metrics proposed here is when a restore is requested. Instead of merely bringing the data back, accompany it with a report showing how the request was handled, as well as some details on the data. Your customers will want to know the date and time of the recovery source and may need an explanation as to why they had to wait for their data, especially if it stretches into a multiday process.

Dig Deeper on Data backup security