Graphic illustration showing data stream flowing out of a cloud symbol
Shutterstock.com/Blackboard

SRE stands for Site Reliability Engineering. It builds upon the principles of DevOps to bring an engineering-led approach to IT operations. SRE uses software to automate system operation, identify problems, and implement resolutions.

The concept of SRE developed at Google. It’s based on the idea that code and software is the most effective way to manage large scale systems. Manual procedures initiated by a separate team carry a risk of oversight and inconsistency.

In this article, you’ll learn what SRE is and how it helps to streamline cloud operations. We’ll also explain where SRE overlaps with DevOps, as well as the ways in which it differs.

Where Does SRE Fit Into Software Delivery?

SRE concerns operations management. It enters the software delivery process after code has been developed, reviewed, and deployed. Site reliability engineers usually observe, maintain, and optimize those deployed services, taking over the responsibilities of administrators.

The distinguishing characteristic of SRE compared to traditional operations is the emphasis it places on automation. Infrastructure controls, change management, audits, and incident response should all be automated within the model. The SRE practitioner focuses on provisioning and running software tools that achieve these tasks, instead of directly interacting with the system themselves.

SRE unifies disparate aspects of the operations management experience. Using a tools-driven process means there’s fewer places for problems to occur. This helps to increase stability as systems grow, even if the size of the SRE team remains static.

What Do SRE Engineers Actually Do?

SRE engineers are usually software developers who are also experienced with operating production services. This gives them an holistic awareness of the delivery process, from code commit to incident resolution. They’ll use this knowledge to design and implement mechanisms for deploying and monitoring live environments.

As “reliability” is literally in the name, SRE teams are also responsible for measuring uptime and devising ways to improve it. SRE engineers set the service-level objectives (SLOs) that provide reliability targets for the organization. They’ll establish and observe the service-level indicators (SLIs) that inform whether the objectives are being met, such as error rate, request throughput, and ticket count. SREs will be involved in writing the service-level agreements (SLAs) that are shared with customers too.

SRE engineers are the effective gatekeepers around new deployments. Their focus on preserving stability means they’ll sometimes instigate deployment freezes if an SLO or SLA is about to be breached. The SRE team can direct developers to focus on addressing the cause of incidents, instead of continuing to roll out new work.

No service can expect to run with 100% reliability. SRE recognizes this by granting developers an “error budget” which they’re allowed to “spend.” Once that budget’s been exceeded by new bugs, tickets, or outages, addressing the problems becomes everyone’s priority until the error budget and SLOs are restored.

It could be an SRE engineer who completes this remedial work by writing new code. Because the SRE team has a background in software engineering, they’re equipped to deal with problems on their own initiative. In times when the service is running well, people in SRE roles revert back to being regular developers. Google’s SRE engineers are expected to spend at least half their time on development work.

This unique balance of development and operations helps to preserve the SRE engineer’s ability to oversee the delivery process. Their level of visibility is invaluable when it comes to spotting risks that could cause an incident. It also encourages engineers to minimize the time spent on operations tasks by implementing new tools and automated procedures. This can create a self-sustaining cycle: a greater degree of automation usually makes the service more reliable, reducing the ops workload for the SRE team. In turn, engineers are freed up to return to development, increasing throughput.

How Does SRE Align With DevOps?

DevOps is a far-reaching term that describes using modern technologies and methodologies to deliver higher quality software more quickly. This is achieved by narrowing the gap between development and operations teams, then layering automation over the software delivery process.

So far this sounds similar to SRE. However SRE has a single objective in mind – reliability – whereas DevOps considers tangential concerns too, such as developer efficiency and delivery speed. It’s noteworthy that DevOps is often approached as a bridge between development and operations while SRE fuses them together. In SRE dev and ops tasks are completed by the same people, with development gaining the bulk of the attention.

For these reasons SRE can be seen as a specific implementation of DevOps. Although the overall objectives are similar and strongly aligned, SRE describes a method of achieving them: use error budgets, SLOs, and SLIs to guard services against errors, then implement protections that allow the work bias to return towards development.

Benjamin Treynor Sloss, the Google engineer who coined the term SRE, states that SRE can be seen as “a specific implementation of DevOps with some idiosyncratic extensions.” Alternatively, you can invert the model and approach DevOps “as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel.”

One significant way in which SRE differs from DevOps is its reliance on data. DevOps is often seen as a set of principles for efficiently moving code from developer workstations to production environments. This means working in terms of commits, merge requests, pipelines, and containers. SRE is a strategy for deploying changes with maximum reliability and reduced chance of regression. Effective SRE requires continual observation and analysis to work out where errors have occurred and how they might repeat in the future. It’s more investigative and self-aware than a typical DevOps implementation.

Is SRE a Good Career Move?

SRE has only recently begun to attract mainstream attention. It can be challenging to find an SRE role because many organizations are yet to recognize the model’s benefits. In some cases a form of SRE may be present inside an organization but this might not be reflected in the roles they advertise.

Despite its specialized nature, SRE is typically a good career move. It demands an intersection of skills, spanning from software development through to service operation and incident response, with a good degree of depth in each. There are few candidates who can offer this which means SRE roles tend to be lucrative positions.

An analysis by GitLab in April 2022 found only 21,000 SRE openings while there were 104,000 DevOps positions. Data from Glassdoor indicated a salary range of up to $300,000 for SRE work though, as opposed to $234,000 for DevOps.

Moving into an SRE role could be a rewarding opportunity for individuals who want to remain in the development field while gaining hands-on experience of service operation. It’s especially suited to people who find traditional administrator roles too repetitive and hands-on. As an SRE, you’ll be expected to automate operations, look for opportunities to enhance service quality, and contribute to regular development efforts after the incident pager’s gone quiet.

Conclusion

Site Reliability Engineering uses methods commonly associated with software development to automate service operations. SRE engineers are experienced developers who are also familiar with the challenges of running and scaling services in production. They establish a toolchain for measuring and optimizing reliability, taking over the tasks formerly handled by dedicated system administrators.

SRE can be seen as an implementation of DevOps principles. Appointing SRE engineers should result in a more resilient service which can accept rapid change. This achieves the DevOps goal of accelerating software deployment without impacting quality. SRE sets out a specific strategy that works towards this by emphasizing data measurement, as well as unification of dev and ops talent.

Whereas DevOps is now broadly understood in the community, SRE remains an emerging focus area for many organizations. Openings can be harder to find but they tend to be more lucrative when they appear. This reflects the varied set of skills that SRE engineers need to possess. Demand is likely to grow rapidly over the next couple of years, so now’s the time for candidates and organizations to start paying attention to the shift towards SRE.

Profile Photo for James Walker James Walker
James Walker is a contributor to How-To Geek DevOps. He is the founder of Heron Web, a UK-based digital agency providing bespoke software development services to SMEs. He has experience managing complete end-to-end web development workflows, using technologies including Linux, GitLab, Docker, and Kubernetes.
Read Full Bio »