· jpena.dev

What is Site Reliability Engineering?

Site Reliability Engineering is what happens when you ask a software engineer to do operations.

In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, security, and capacity planning.

First and foremost, SREs are engineers. We apply the principles of computer science and engineering to the design and development of computing systems: generally, large distributed ones.

Core Principles

SRE is about creating reliability through engineering. That usually means:

Service Level Objectives (SLOs) to define what “good” looks like
Error budgets to balance reliability with the pace of change
Automation to reduce toil and make systems predictable
Blameless culture that focuses on learning and prevention

Common Responsibilities

Day‑to‑day work varies by company, but the shape is consistent:

Designing and maintaining resilient systems
Observability (metrics, logs, tracing) and actionable alerting
Incident response and post‑incident reviews
Capacity planning and performance optimization
Improving developer experience and deployment safety

What is Site Reliability Engineering?

Core Principles

Common Responsibilities

SRE Resources