What is Site Reliability Engineering?
Site Reliability Engineering is what happens when you ask a software engineer to do operations.
In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, security, and capacity planning.
First and foremost, SREs are engineers. We apply the principles of computer science and engineering to the design and development of computing systems: generally, large distributed ones.
Core Principles
SRE is about creating reliability through engineering. That usually means:
- Service Level Objectives (SLOs) to define what “good” looks like
- Error budgets to balance reliability with the pace of change
- Automation to reduce toil and make systems predictable
- Blameless culture that focuses on learning and prevention
Common Responsibilities
Day‑to‑day work varies by company, but the shape is consistent:
- Designing and maintaining resilient systems
- Observability (metrics, logs, tracing) and actionable alerting
- Incident response and post‑incident reviews
- Capacity planning and performance optimization
- Improving developer experience and deployment safety