SRE & Error Budgets - Reliability Is a Feature, Not an Afterthought
Helpful context:
- Monitoring & Observability - Knowing What Your System Is Doing Without Looking at Every Line
- Fault Tolerance - Building Systems That Survive Their Own Failures
- Microservices - Small Services, Large Coordination Problems
In 2003, Google was growing fast enough that the gap between what software engineers built and what the operations team could run was becoming a crisis. The operations team - the people responsible for keeping systems running - was drowning in manual work: provisioning servers, responding to alerts, managing deployments. The software engineers, meanwhile, were shipping features that made the operations problem harder. The two groups had different incentives and no shared language for the tension between them.
Ben Treynor Sloss, a software engineer at Google, was handed a small operations team and told to figure it out. His answer was to staff the team entirely with software engineers, give them an explicit mandate to automate themselves out of repetitive work, and hold them to engineering standards rather than operations conventions. He called this Site Reliability Engineering (SRE).
The insight was deceptively simple: if the people running systems are engineers who hate repetitive work, they will automate repetitive work. And if you measure reliability in the same rigorous way you measure feature delivery, you can make rational decisions about how much reliability you need and what you are willing to sacrifice to get more features faster.
The Three Layers of Reliability Agreements
Before SRE had language for reliability, agreements about uptime were fuzzy. “The system should be reliable” meant different things to different people, and when something went wrong there was no objective basis for deciding whether a response was adequate.
SRE introduced three increasingly specific levels:
SLI (Service Level Indicator): a quantitative measurement of some aspect of the service’s behavior. Not a target, just a measurement. Examples: the fraction of HTTP requests that return a successful response (availability), the fraction of requests that respond within 200ms (latency), the fraction of storage writes that persist correctly (durability). The SLI is a number you compute from your monitoring data.
SLO (Service Level Objective): a target for an SLI over a time window. “99.9% of requests should succeed over any 28-day rolling window.” “The 95th percentile latency should be under 200ms over any calendar week.” The SLO is what you actually aim for. It is an internal target, not a customer commitment.
SLA (Service Level Agreement): a contractual commitment to customers, usually with financial consequences if violated. The SLA is set lower than the SLO - the SLO is what you aim for internally, and you only commit externally to something you are confident you can exceed. If your SLO is 99.9%, your SLA might be 99.5%. This gives you room to miss your internal target without breaching a contract.
The hierarchy matters: you measure SLIs, you target SLOs, and you contractually commit to SLAs that are looser than your SLOs.
Error Budgets: Reliability as a Resource
The SLO sets a target. The error budget is what you can afford to spend before breaching it.
If your SLO is 99.9% availability over 28 days, you are allowed 0.1% of requests to fail. For a service handling 1 million requests per day over 28 days - 28 million requests total - that is 28,000 failed requests before you breach the SLO. That is the error budget.
The critical insight is that error budgets make reliability a shared resource with a clear accounting. They answer the question that was previously a perpetual argument: how much reliability do we need, and whose fault is it when we do not have enough?
With error budgets:
- If the budget is healthy (mostly unspent), the development team can ship faster and take more risk. New deployments, infrastructure changes, experimental features - all of these burn error budget, but that is acceptable when there is budget to burn.
- If the budget is nearly exhausted, the team shifts priorities toward reliability: deferring risky deployments, fixing known sources of failures, increasing test coverage. Not because someone mandated it, but because deploying now would breach the SLO.
- If a postmortem reveals that most budget was burned by a single class of incident, the error budget provides objective evidence that fixing that class of incident should take priority over new features.
The error budget converts the argument “are we reliable enough?” into an accounting problem. It also aligns incentives: neither software engineers nor SREs want to burn the budget, because exhausting it means no new features until reliability is restored.
Choosing the Right SLIs
Not every measurement makes a useful SLI. The SLIs that matter are the ones that track how users actually experience the service.
Google’s SRE book identifies a small set of useful SLI categories:
Availability: what fraction of requests succeed? Defined precisely by what “succeed” means for your service. For HTTP services, a 200-class response. For a batch pipeline, a job that completes without errors. For storage, a read that returns correct data.
Latency: how long do requests take? Almost always measured at a percentile (P99, P95) rather than an average. Averages are misleading because they hide tail latency: a service where 99% of requests take 10ms and 1% take 10 seconds has an average of ~110ms, which sounds reasonable. The P99 of 10 seconds is what users actually experience when they are unlucky.
Throughput: for data processing systems, how many records or bytes are processed per second? An SLI for a pipeline might be “at least 95% of events are processed within 5 minutes of ingestion.”
Correctness: does the service return the right answer? Harder to measure than the others, because you need a reference to compare against. Sometimes approximated by testing a sample of requests against a known-good oracle.
The common mistake is measuring too many things. An SLO with seventeen SLIs becomes noise. Start with the one or two measurements that most directly reflect the user experience, and add more only when evidence shows that the existing ones miss important failure modes.
Toil: The Enemy of Progress
SRE gives a specific name to a specific kind of work: toil. Toil is manual, repetitive, automatable work that scales with service growth, produces no lasting improvement, and has no intrinsic value beyond doing it.
Examples: manually restarting a service when it hangs, manually adding capacity when traffic spikes, manually triaging the same class of alert every week, manually rotating credentials on a schedule.
The distinction between toil and other operational work matters. Engineering work - designing a system, writing automation, building observability - produces lasting improvements that compound. Toil produces nothing lasting; the same work must be done again next time. If an SRE team spends most of its time on toil, it cannot invest in the engineering work that would reduce future toil.
SRE teams at Google target keeping toil under 50% of their time. The rest is engineering: automation, reliability improvements, architectural changes that eliminate classes of incident. The 50% figure is a policy, not a natural law, but it enforces the principle that an SRE team should be getting better at its job over time, not just running faster on a treadmill.
The automation target also explains why SRE is staffed with engineers who could be writing product code. The opportunity cost of their time is high enough that keeping them on manual operations work is obviously wasteful. If a task is worth automating, put someone capable of automation on it.
Incident Management
When a significant incident occurs - a widespread outage, major data loss, a sustained SLO breach - SRE uses a structured response.
The incident commander is a single person responsible for coordinating the response. Not the most senior engineer in the room, but whoever is designated. Having a single coordinator prevents the chaos of everyone making independent decisions about what to try next.
The incident commander delegates: one person investigates the symptoms, one person handles communications (to stakeholders, to customers), one person manages the runbook if there is one. The commander keeps a live incident document updating in real time with what is being tried and what is known.
The guiding principle during an incident is mitigate first, diagnose second. If you can stop the bleeding by rolling back a recent deployment, do that before you fully understand why the deployment caused a problem. Understanding why is for the postmortem. The incident response is about restoring service.
Blameless Postmortems
After an incident is resolved, the team writes a postmortem - a structured document that explains what happened, why it happened, and what will be done to prevent it from happening again.
The word blameless is deliberate and important. When an incident is caused by a human error - someone clicked the wrong button, someone deployed untested code, someone misread a dashboard - the human error is the proximate cause, not the root cause. The root cause is the system design that made it possible for that human error to occur.
If you fire or blame the engineer who made the error, you learn nothing. The same error will happen again, committed by a different engineer, because the system still permits it. If instead you ask “why was it possible to click the wrong button?” you might discover that the interface does not ask for confirmation, or that staging and production look identical, or that there is no automated check before deployment. These are fixable. They will prevent the error from recurring regardless of which engineer is on call next time.
Blameless postmortems create an environment where engineers report incidents honestly and completely, because they know the purpose is learning, not punishment. This is not soft management. It is the only way to get accurate data about what is actually going wrong in a complex system.
A good postmortem includes: a timeline of events, the contributing factors (not just “human error” but the system conditions that enabled it), the impact (SLO burn, user impact, financial cost), action items with owners and due dates, and lessons learned that apply beyond this specific incident.
Summary
| Term | What it is |
|---|---|
| SLI | A measurement: the fraction of requests that succeeded |
| SLO | A target: 99.9% of requests should succeed over 28 days |
| SLA | A contract: 99.5% guaranteed, with financial penalties |
| Error budget | The allowed failure quota before breaching the SLO |
| Toil | Manual, repetitive work that scales with service growth |
| Postmortem | A blameless analysis of what caused an incident |
The key shift SRE introduced is treating reliability as an engineering problem with measurable objectives, rather than a vague aspiration maintained by human vigilance. Error budgets create shared accountability between teams. Toil budgets force investment in automation. Blameless postmortems produce honest data. Together, these practices make it possible to run systems that actually meet their reliability commitments, at scale, without burning out the engineers responsible for them.
Read next: