What does a site reliability engineer (SRE) do?

Site Reliability Engineers, or SREs, are primarily responsible for how code is managed within an organization’s IT infrastructure. It’s their job to monitor, configure, and deploy new code, as well as handle change management and emergency response for services in production.

As software engineers with experience in IT operations, SREs develop automated processes to handle operational tasks like creating test environments, analyzing logs, and on-call responses to errors and other issues.

This job role will typically involve working across different teams - typically development and operations - to provide support and deliver reliable systems to both of those teams. This is a key part of the development ecosystem, as it allows for both development and operations to focus on building new features while SREs ensure the right infrastructure is in place to keep downtime and errors at a minimum.

Ideally, Site Reliability Engineers will split their time between project work and operations tasks. According to Google, SREs should be monitored to ensure they spend no more than 50% of their available time on operations, otherwise, tasks like implementing automated processes, creating new features, and scaling the system will not be completed.

SREs are also responsible for handling incidents with infrastructure and, depending on the employer, may be required to be part of an on-call rotation outside of typical office hours. Incident management for SREs will typically involve an evaluation of the incident after it has been resolved to determine whether any new automated processes, documentation, or other tools need to be built or updated to prevent the incident from happening again.

Similarly, SREs will typically be expected to continuously update documentation and/or rulebooks to ensure that their knowledge is communicated across the organization.

Required skills are usually:

SREs typically need to have experience with both operations and development. In some cases, you may be considered for an SRE role if you have a background in one area and foundational knowledge in the other, but that will depend on the organization and its requirements. 

Tech stacks can also vary between companies, but SREs are usually expected to have a profound knowledge of:

  • Back-end programming languages like Python, Java, GO, or Ruby
  • DevOps tools like Kubernetes, Docker, or Git
  • Operating systems like Windows or Linux
  • Cloud infrastructure configuration management and deployment, particularly tools like Terraform
  • Cloud computing services like AWS and Google Cloud
  • Databases, and particularly experience with SQL
  • Experience with monitoring systems like DataDog and Prometheus

These operating systems usually form the bulk of most corporate IT infrastructure, so a working knowledge of them is vital.

It’s also important for SREs to have strong interpersonal skills. You’ll regularly be working with other software engineers, operations engineers, management, and in some cases, the chief technical officer. You may also be required to be on-call for external customers. Good communication means SREs can report incidents as clearly as possible and support other engineers with their work. 

Some organizations may also require previous experience working in incident management for customer-facing applications. You may also need experience working as part of a 24/7 on-call team. This may not be applicable to every SRE job role, however, if you have this experience, it can be a significant advantage. 

Average salary ($, US based): $117,992 (Payscale)