How to Use Error Budgets to Protect Service Reliability

Quick Links

What Is an Error Budget?

Error Budgets and Engineers

What Happens When an Error Budget Is Spent?

The Business Impacts of Regularly Spent Error Budgets

Summary

An "error budget" describes the amount of time a system can be offline before it has tangible consequences for your business. Error budgets are used alongside service level agreements (SLAs) and service level objectives (SLOs) to inform organizations when a system's unavailability has tipped into a breach of contract.

Incorporating error budgets into your application reliability strategy provides a methodical approach for balancing risk-taking with stability. Error budgets acknowledge that occasional outages, buggy deployments, and simple mistakes are inevitable. Their role is to tell you how many of these incidents you can endure. The available error budget also decides whether your next task is building a new feature or tackling another bug fix.

What Is an Error Budget?

A service's error budget is simply a measure of the maximum time it can be in a failed state without incurring contractual, financial, or regulatory penalties. The available error budget is derived from the uptime figure you commit to in the SLAs you send to customers. You could be more stringent by basing your error budget on an SLO instead.

SLA - The uptime you publicly commit to, such as 99.95%. Most organizations using SLAs will be contractually obliged to recompense customers if the service's actual uptime drops below this figure.
SLO - The uptime you aim for internally, such as 99.99%. This means an uptime figure between 99.95% and 99.99% is undesirable and provides an indication that reliability improvements are required. It doesn't make you liable to recompense customers, however.
Error budget - A calculation of the amount of downtime permissible by an SLA or SLO.

You can calculate your error budget using simple multiplication. As an example, a SLA that states your service will have 99.99% availability over the course of a year gives you a total error budget of 52 minutes and 35 seconds. An outage that lasts 30 minutes won't directly affect your business. One that lasts an hour will exceed the error budget and necessitate compensation for customers.

Here are a few other examples:

SLA %	Annual Error Budget	Monthly Error Budget
99.99%	52 minutes, 35 seconds	4 minutes, 23 seconds
99.95%	4 hours, 23 minutes	21 minutes, 54 seconds
99.90%	8 hours, 46 minutes	43 minutes, 49 seconds

Error budgets can be derived from any kind of SLA, not just uptime. Successful request counts, performance measurements, and resource utilization metrics are often used as SLAs and SLOs too. An SLA that states 99% of requests will be successfully handled each day will trip its error budget if 10,000 requests have been made and less than 9,900 of them have succeeded.

Error Budgets and Engineers

Error budgets aren't just an easier way of working out when your SLA's been breached. They're also used to set the priorities of your development teams. An error budget is a control mechanism that determines the kind of work to focus on.

When your error budget is full, developers can work without restriction. They can tackle new features, make sweeping changes to systems, and apply risky migrations to production environments. These actions have the potential to introduce bugs and flaky behavior, depleting the error budget. The error budget is "spent" through this innovation.

When the available error budget reaches an agreed threshold, developers have to take action to stop it falling any further. Engineering efforts should pivot towards bug fixes and optimizations that will improve reliability and stabilize the service. This lessens the risk that another problem will occur and exhausts the error budget entirely.

It's important to recognize that error budgets are supposed to be consumed, up to the warning threshold. They promote developer autonomy by allowing engineers to take risks and innovate on their own initiative. Error budgets simultaneously provide guard rails that prevent developers from fixating on forwards movement at the expense of the service's reliability. A draining error budget protects the business by instructing developers when they need to refocus on stability.

What Happens When an Error Budget Is Spent?

A fully spent error budget can occur because you've moved through a period of high innovation or you've experienced a succession of long outages. There are many chains of events which could lead to an error budget being depleted; what matters is how you respond when it happens.

Running out of error budget shouldn't be taken lightly. You've got no spending power left so you shouldn't invest in further innovation. An error budget can be likened to a credit line from your customers: spending beyond your limit will worsen the situation and could severely harm your brand's outlook.

Freezing all non-essential work should be your first response to going over budget. This needs to happen immediately when the budget is exhausted. Block new deployments from reaching production, reallocate developers who are building new features, and evaluate the quickest way to restore the service. Your error budget will naturally revive as time elapses after the incident's resolved.

You should complete a retrospective upon resolution to analyze what happened. There could be opportunities to increase reliability by changing tools or improving your process. Enforcing more stringent code reviews, automatically running your test suite in CI pipelines, and using static analysis to spot common gotchas are three effective ways of quickly increasing code quality.

The Business Impacts of Regularly Spent Error Budgets

Regularly using up your error budget is a sign that your application's unstable and needs to be more resilient. A continual stream of SLA-breaching incidents will create a poor perception of your product. Users expect software to be reliably available when they need it. Customer confidence will be harmed when this isn't the case, which could cause you to lose out to competitors.

Although exceeding an error budget can happen for countless reasons, doing so repeatedly can hint at bigger problems in your organization. You could be trying to move too fast with an overly ambitious roadmap. This can put undue pressure on engineers and create an environment that's conducive to errors.

Error budgets might feel like they're blockers in naturally fast-paced organizations. Remembering the intention behind error budgets should help to keep everybody on board. They're a form of risk management that provide actionable metrics for deciding engineering priorities. Error budgets are there to protect your business from the negative impacts of incidents by telling you when to step back and slow down. Attempting to override or ignore them can jeopardize your service's future.

Summary

The most successful software solutions combine continual innovation with dependable stability. Many developer teams struggle to successfully balance these two contradictory concerns. Developers are often naturally forwards-looking whereas users want a familiar solution that they can depend on.

Error budgets are an effective mechanism for resolving this dilemma. They allow developers to innovate freely within fixed constraints that preserve service reliability. Error budgets protect the business from the impacts of SLA breaches by instructing engineers to refocus on stability as the amount of downtime increases.

You can implement error budgets by establishing an SLA or SLO and then calculating the amount of unavailability it permits. You'll also need to track the durations of new incidents so you know when your error budget's being consumed. Incident management platforms such as Opsgenie, Pagerduty, and Blameless can automatically capture this information and provide real-time alerts for error budget depletion events.

Using error budgets lets you build more reliable applications that consistently meet user expectations. Error budgets provide data to inform engineering decisions and balance innovation with stable operation. This creates the consistency that's missing in many of today's existing services.