Service Level Agreements are used to define what’s an acceptable level of service delivered by IT to the business unit. In theory, and too often in practice, if the SLA numbers are good, by definition the business user has to be happy. If not, then he’s being unreasonable as he’s getting what was agreed. As a result, the SLA is viewed by both the business unit and by IT the way to demonstrate that IT is doing its job regardless of the actual business service performance. If any one of your business users would agree with this, then it’s time to re-examine your SLA process.
Most SLAs are too detailed and complicated with dozens or hundreds of metrics, mostly technical measures taken at the IT provisioning end: application uptime, memory faults, network availability, etc. These are meaningless to the business user, who sees IT services like a car: “When I turn the key, it should start and take me where and when I want to go. And did I say now?” For example, a business unit with a critical real-time data collection application experiences network congestion for most of a day. The day’s work was lost and the business user was understandably upset. Based on the SLA, the network was available as the packet discard level was within acceptable levels, so the incident was classified as medium-severity. As IT only follows-up on high-severity incidents, there was no root cause analysis and nothing done to avoid a future reoccurrence. The business user is unhappy: his e-mail and other applications stayed up, but day’s worth of work was lost, and IT is doing nothing to prevent it from happening again. IT is also upset: The user is complaining about a network outage that didn’t happen. Services were delivered and met the SLA targets.
The result is anything but desirable. The user views IT as failing to deliver the value promised and not caring about the business or cleaning up their act. Rebuffed by IT, the business user will pursue her concerns through other channels: the CFO, CEO, the annual budget planning cycle. IT views the client as an unreasonable complainer; nothing will make them happy, so why try?
The incident is a composite but highlights the difficulty in developing and maintaining a good working relationship between the business user and IT that has the trust and give-and-take needed to maintain open communications through the inevitable mistakes and problems that will happen. That’s what an SLA should be, defining the services and the expectations for how those services are to be delivered and setting out the process to be used when something isn’t working or there is a disagreement on what “working” means. Both parties have to be confident that the SLA is measuring the things that matter to the business in its terms and in the processes to deal with disagreements and problems.
The challenge is in how to define the SLA in business terms observable by the business user, but easily measurable, usually by IT from the data centre. That’s where less is more. Too many measures create one of two impressions: IT can’t ever get everything right, or IT got 99 of 100, so what’s the problem?
And what and how to measure? The accounts payable application is only needed during business hours, but the corporate Web store is needed 24/7. If accounts payable fails for 15 minutes, it’s an inconvenience, but probably can be managed through normal processes in AP. If IT is using a charge-back system, that service level will be reflected in the cost of the services. However, if the Web store is unavailable for an hour, especially at peak times, it is lost sales opportunity with the risk that both existing and new customers may move on, never to be seen again. SLA measures have to be sensitive to the need, business risk, reputational risk, and costs to manage the service.
IT often contributes to the problem by agreeing to host critical applications on infrastructure that can’t support them because the business unit (or IT central budget) couldn’t afford to do it right. Instead of adjusting the SLA, it’s “hope for the best” – not a long-term success strategy. An SLA with a 99.9 per cent availability has 10 minutes of downtime every week, to occur at the worst time (Murphy’s Law). If that’s not going to meet the business needs, then the SLA needs to have different standards for different hours and the infrastructure must be able to provide them during those times.
Another SLA “feature” that erodes business user trust is carving out maintenance times from the SLA measurements. If an application must be unavailable for four hours each week, then the SLA should be transparent and include it. Hence the 99.9 per cent becomes 97.5 per cent per week. Looks terrible when 24/7 is the expected norm, but clearly shows the business unit owner the limits of the infrastructure she’s willing to pay for.
The bottom line is that SLAs are not budgets, numbers games, or IT’s CYA. SLAs define a shared set of values and agreed processes that establish and maintain a “no surprises” relationship between the business unit and IT. Both must have a shared understanding of what the services are and how they should behave. They also need to understand that they won’t always agree and that they need to work out their differences within an established framework. IT has to be bilingual, speaking both the language of business and IT jargon, sensitive to the differences in culture where the languages are spoken.
The hard part for many of us is that SLAs are already in place, they’re complex and technical, and difficult to relate to the actual business processes. Where to start the change? With that cranky client that you know you can’t satisfy. You can find solutions to the technology problems that are hurting his business. It’s only a question of time, money, and/or focus. The business user needs to be reassured that IT is being responsive within its resource constraints and understands that the fix needed may require him to change expectations based on cost. While not a panacea, re-building the SLA offers the opportunity to establish a framework to work together in both good and bad times, starting by resolving long standing frustrations and moving on to work together to make the business run as well as it can.