This post is a continuation of my previous one on The Four Foundations of SaaS. Here, I'll discuss an approach that has proven effective in improving Stability for SaaS products. This metric is highly specific to your business, to its specific systems and customers, and it works across all functions to reduce disruptions and dissatisfaction amongst your staff, customers, and shareholders.
This metric was born during a period of instability with the service platform at a business I worked at some years ago. The CEO was “bewildered” (his words) by the range, variety and apparent “gelatinous” nature of what was causing numerous, frustrating outages. To his perception the system kept failing, and each failure felt and looked the same. For each though, there was always a specific cause, some were repeats, some were novel others were seemingly out of our control, having been perpetrated by third party providers. From the CEO’s perspective something more transparent was needed to gauge progress towards eliminating these issues, and he was right.
I sat down with the team to thrash out how to come up with a measure that would capture the myriad sources of these incidents but portray them as one, easy to understand metric. After some mental wrestling, and more than a few expletives the Impact Score was born.
The Impact Score is a single metric, per product, service, or component (take you pick, or mix them) that captures the essence of an incident, and scores it. The score is determined according to the timing, duration, breadth and level of overall disruption the incident creates. It’s hard to make this score generic as every product is different and every cohort of users has a different peak and different sensitivities. Some products have no peak, it is always peak.
To create an impact score model, you will need to be able to understand the key elements of how your product is used.
These can be obtained from your ticket analysis via your helpdesk or through direct feedback from customers and/or Product team. You may already know when your sensitive times are in any 24 hour period, and when your quiet times are. Outages or incidents that occur during these prime time windows need to be weighted more heavily, to drive up the score of the incident. Incidents that happen at quieter times will score less highly. Obviously if you have no peak, this is pretty straightforward.
Categorising your incidents like this is not a way to diminish their impact, but to ensure a correct amount of weighting is added to the score. If the incident is a full outage where the system is down, then it will be weighted to the maximum. If it is a partial outage or performance degradation, it will be weighted less heavily but still based on time of day. Also, some service components may be down but don’t impact customers per se, for example a data sync or backup might not run, this exposes risk rather then impact and can usually be recovered, but it shouldn’t be overlooked and must be managed.
The profile of the incident is how it impacts your customers and your business. If it’s an outage but it is only annoying or limits productivity for a little while, then the incident might score less than another that prevents your customers, or your own business, from generating or collecting revenue. Furthermore, if your service is relied upon over a wide user base, things like the threat of press coverage may drive the profile of the incident.
These three core components make up the immediate assessment of the incident and, in the moment of an incident, will generally be condensed by your incident management team to a Priority assessment, P1, P2, P3 or P4. This is useful during the incident for people to understand how to prioritise focus, but there will be additional elements that need to be added to create the Impact Score analysis once the incident is over and the retrospective or root cause has been explored. For the impact score to work it is essential that you capture some meta data about the incident itself.
What went wrong and where? Was is a code release, a storage issue, a database problem, a network problem etc. The range here is highly specific to your business and your service infrastructure. Don’t be lured into a false sense of security if you’re using public cloud infrastructure either, this is your platform and how it operates is your responsibility. Relying blindly on cloud native protections is not a good idea when it’s your code running on that infrastructure.
So how does this score look and how do you go about setting up a meaningful number and target for month to month tracking? Consider this table of metrics. The numbers only matter in the context of your business and these are made up for illustration. What is important is to ensure you generate an algorithm that truly reflects the nature of the impact for each incident.
Here you can see that April, May and June were all pretty bad; some significant infrastructure wobbles with third party and human error playing interference. But the highlight is the event in November which eclipsed all other problems.
|Impact Score||150||139||145||109||86||95||35||265||45||40||55||25||FY Totals|
If we assume a target impact of 50 per month, the management report graph for this product would look like this.
A declining trend towards target, time for the target to be re-baselined.
How you generate a score that is relevant and meaningful for your products is not easy to explain. There will be a certain amount of adjustment, over time, refining the calculations and weightings so that they begin to “feel” right to your teams. But once embedded and routinely produced, you will have a metric that allows you to understand impact across a wide range of problem types, origins and products.
Using this score to set goals and to provide evidence of improvement is good for your teams, your customers and your shareholders.
<aside> 💬 Footnote: I have used this metric successfully over the last 10 years and it has been refined significantly from its original format. I have used it to improve the Stability of some large and high performing platforms. The approach is technology agnostic and can be used outside of software platforms to fit most production or quality monitoring situations. The best return on this approach is obtained when it is part of the day-to-day operating model of the business to collect this data, preferably through the system used for incident management and reporting.
Clearly there is work to be done across all categories but the high impact ones are Infrastructure, Security and Third Party. This helps with the narrative and justification for effort to be expended on these areas as resolving the problems, assuming they’re systematic, will drive down impact overall.
This view is also good as it demonstrates a declining trend in impact and also highlights how latent risks can create big problems, in this instance a poor poise around the ability to respond to a security incident, and a rushed software patch creating another incident.
There are many ways to cut an impact score and with more than one product it can be used to show increasing availability and therefore Stability for each.
Given an incident can occur for a variety of reasons and originate from various sources, this provides a good cross-functional measure across your supply side teams and builds a framework for all, working towards one goal, in this illustration, a sub 50 month.
Finally, the necessary analysis and technique used for root cause analysis for each incident will generate a wealth of knowledge and understanding amongst your technical teams about how and why things go wrong. Habitualising these behaviours will ignite a valuable continual improvement culture.
Being able to clearly show where problems originate is important to help justify expenditure or prioritisation to ensure reducing impact overall.
<aside> 🏢 Company Information
© Wolfen Consulting Ltd. 2023 All rights reserved.