Site Reliability Engineering (SRE)

SRE is a discipline that applies software engineering principles to infrastructure and operations problems.

Core Principles

1. Service Level Objectives (SLOs)

SLO = Target reliability level for a service

interface SLO {
  name: string
  target: number // e.g., 99.9%
  window: string // e.g., "30d"
  indicator: SLI
}

const apiSLO: SLO = {
  name: 'API Availability',
  target: 99.9,
  window: '30d',
  indicator: {
    type: 'availability',
    measurement: 'successful_requests / total_requests'
  }
}

2. Service Level Indicators (SLIs)

SLI = Quantitative measure of service level

// Common SLIs
const slis = {
  // Availability
  availability: {
    measurement: 'successful_requests / total_requests',
    target: 0.999 // 99.9%
  },
  
  // Latency
  latency: {
    measurement: 'requests_under_200ms / total_requests',
    target: 0.95 // 95th percentile < 200ms
  },
  
  // Throughput
  throughput: {
    measurement: 'requests_per_second',
    target: 1000 // min 1000 rps
  },
  
  // Error rate
  errorRate: {
    measurement: 'error_requests / total_requests',
    target: 0.001 // < 0.1%
  }
}

3. Service Level Agreements (SLAs)

SLA = Contractual agreement with consequences

interface SLA {
  slo: SLO
  consequences: {
    breach: string
    penalty: string
  }
}

const apiSLA: SLA = {
  slo: apiSLO,
  consequences: {
    breach: '< 99.9% uptime in 30 days',
    penalty: '10% service credit'
  }
}

Error Budgets

Concept

class ErrorBudget {
  private slo: number // e.g., 99.9%
  private window: number // in seconds
  
  calculateBudget(): number {
    const allowedDowntime = (1 - this.slo / 100) * this.window
    return allowedDowntime
  }
  
  getRemainingBudget(actualUptime: number): number {
    const allowedDowntime = this.calculateBudget()
    const actualDowntime = this.window - actualUptime
    return allowedDowntime - actualDowntime
  }
}

// Example: 99.9% SLO for 30 days
const budget = new ErrorBudget()
budget.slo = 99.9
budget.window = 30 * 24 * 60 * 60 // 30 days in seconds

// Allowed downtime: ~43 minutes per month
console.log(budget.calculateBudget()) // 2,592 seconds (~43 min)

Error Budget Policy

interface ErrorBudgetPolicy {
  onBudgetExhausted: Action[]
  onBudgetWarning: Action[]
  onBudgetHealthy: Action[]
}

const policy: ErrorBudgetPolicy = {
  onBudgetExhausted: [
    'Freeze feature releases',
    'Focus on reliability',
    'Cancel non-critical deployments',
    'Conduct incident review'
  ],
  
  onBudgetWarning: [
    'Increase monitoring',
    'Review upcoming changes',
    'Prepare rollback plans'
  ],
  
  onBudgetHealthy: [
    'Continue normal operations',
    'Consider new features',
    'Controlled experiments OK'
  ]
}

Toil Reduction

What is Toil?

Toil is manual, repetitive, automatable work that doesn't provide lasting value.

// Toil characteristics
interface Toil {
  isManual: boolean
  isRepetitive: boolean
  isAutomatable: boolean
  hasNoValue: boolean // No enduring value
  scalesLinearly: boolean // O(n) with service growth
}

// Example: Manual deployment toil
const manualDeployment: Toil = {
  isManual: true, // Someone SSH's into servers
  isRepetitive: true, // Same steps every time
  isAutomatable: true, // Could be scripted/automated
  hasNoValue: true, // Doesn't improve the service
  scalesLinearly: true // More services = more manual work
}

// Solution: Automate!
async function automatedDeploy(service: string, version: string) {
  await runTests()
  await buildImage(version)
  await pushToRegistry()
  await updateKubernetes(service, version)
  await waitForRollout()
  await runSmokeTests()
}

Toil Budget

// SRE teams should spend < 50% time on toil
interface ToilTracking {
  totalHours: number
  toilHours: number
  engineeringHours: number
  
  getToilPercentage(): number {
    return (this.toilHours / this.totalHours) * 100
  }
}

const teamToil: ToilTracking = {
  totalHours: 160, // 1 month
  toilHours: 60,
  engineeringHours: 100,
  
  getToilPercentage() {
    return (this.toilHours / this.totalHours) * 100 // 37.5%
  }
}

Monitoring & Alerting

The Four Golden Signals

interface GoldenSignals {
  // 1. Latency
  latency: {
    p50: number
    p95: number
    p99: number
  }
  
  // 2. Traffic
  traffic: {
    requestsPerSecond: number
    activeUsers: number
  }
  
  // 3. Errors
  errors: {
    errorRate: number
    errorCount: number
  }
  
  // 4. Saturation
  saturation: {
    cpuUtilization: number
    memoryUtilization: number
    diskUtilization: number
  }
}

// Implementing metrics collection
class MetricsCollector {
  async collectGoldenSignals(): Promise<GoldenSignals> {
    return {
      latency: await this.getLatencyMetrics(),
      traffic: await this.getTrafficMetrics(),
      errors: await this.getErrorMetrics(),
      saturation: await this.getSaturationMetrics()
    }
  }
  
  private async getLatencyMetrics() {
    // Query Prometheus/monitoring system
    return {
      p50: await this.query('histogram_quantile(0.5, http_request_duration_seconds)'),
      p95: await this.query('histogram_quantile(0.95, http_request_duration_seconds)'),
      p99: await this.query('histogram_quantile(0.99, http_request_duration_seconds)')
    }
  }
}

Alert Design

interface Alert {
  name: string
  condition: string
  severity: 'critical' | 'warning' | 'info'
  notify: string[]
  runbook: string
}

// Good alert example
const goodAlert: Alert = {
  name: 'High Error Rate',
  condition: 'error_rate > 0.01 for 5 minutes', // Clear threshold
  severity: 'critical',
  notify: ['pagerduty', 'slack-incidents'],
  runbook: 'https://wiki.example.com/runbooks/high-error-rate'
}

// Alert best practices
const alertRules = {
  // Use symptom-based alerts
  symptom: {
    good: 'error_rate > 1%',
    bad: 'server_down' // This is a cause, not symptom
  },
  
  // Include duration to reduce noise
  duration: {
    good: 'latency > 1s for 5 minutes',
    bad: 'latency > 1s' // Too noisy
  },
  
  // Link to runbooks
  actionable: {
    good: 'runbook: https://wiki.example.com/runbooks/...',
    bad: 'No runbook provided'
  }
}

Incident Management

Incident Lifecycle

type IncidentStatus = 
  | 'detected'
  | 'acknowledged'
  | 'investigating'
  | 'identified'
  | 'resolved'
  | 'closed'

interface Incident {
  id: string
  title: string
  severity: 'sev1' | 'sev2' | 'sev3'
  status: IncidentStatus
  startedAt: Date
  resolvedAt?: Date
  commander: string
  responders: string[]
  timeline: TimelineEvent[]
  impactedServices: string[]
  customerImpact: string
}

// Incident roles
interface IncidentRoles {
  commander: {
    responsibilities: [
      'Coordinate response',
      'Make decisions',
      'Communicate status',
      'Delegate tasks'
    ]
  }
  
  communicator: {
    responsibilities: [
      'Update stakeholders',
      'Post status updates',
      'Manage external comms'
    ]
  }
  
  scribe: {
    responsibilities: [
      'Document timeline',
      'Record decisions',
      'Track action items'
    ]
  }
  
  responders: {
    responsibilities: [
      'Investigate issues',
      'Implement fixes',
      'Monitor impact'
    ]
  }
}

Incident Response Process

class IncidentResponse {
  async handleIncident(incident: Incident) {
    // 1. Detect & Alert
    await this.page(incident.severity)
    
    // 2. Triage
    await this.assessSeverity(incident)
    await this.assembleTeam(incident)
    
    // 3. Investigate
    await this.gatherData(incident)
    await this.formHypothesis(incident)
    
    // 4. Mitigate
    await this.implementFix(incident)
    await this.verifyResolution(incident)
    
    // 5. Resolve
    await this.declareResolved(incident)
    
    // 6. Follow-up
    await this.schedulePostMortem(incident)
    await this.trackActionItems(incident)
  }
  
  private async assessSeverity(incident: Incident) {
    // SEV1: Critical user impact, all hands on deck
    // SEV2: Significant impact, page on-call
    // SEV3: Minor impact, can wait for business hours
    
    const severity = this.calculateSeverity({
      userImpact: incident.impactedServices.length,
      revenueImpact: await this.estimateRevenueLoss(incident),
      dataLoss: incident.dataLoss
    })
    
    incident.severity = severity
  }
}

Capacity Planning

Demand Forecasting

interface CapacityPlan {
  service: string
  currentCapacity: number
  projectedDemand: number
  timeHorizon: string
  recommendations: Action[]
}

class CapacityPlanner {
  async forecastDemand(service: string): Promise<CapacityPlan> {
    // Historical data
    const historicalGrowth = await this.getGrowthRate(service)
    
    // Current usage
    const currentUsage = await this.getCurrentUsage(service)
    
    // Project future demand
    const projection = this.projectDemand(
      currentUsage,
      historicalGrowth,
      { months: 6 }
    )
    
    // Calculate headroom
    const currentCapacity = await this.getCapacity(service)
    const utilizationTarget = 0.70 // 70% utilization target
    
    if (projection.peak > currentCapacity * utilizationTarget) {
      return {
        service,
        currentCapacity,
        projectedDemand: projection.peak,
        timeHorizon: '6 months',
        recommendations: [
          'Scale horizontally: Add 3 more instances',
          'Estimated cost: $500/month',
          'Implementation timeline: 2 weeks'
        ]
      }
    }
  }
  
  private projectDemand(
    current: number,
    growthRate: number,
    horizon: { months: number }
  ) {
    // Simple exponential growth model
    const projected = current * Math.pow(1 + growthRate, horizon.months)
    
    // Add seasonal variations
    const seasonal = this.getSeasonalFactors(horizon.months)
    
    return {
      average: projected,
      peak: projected * seasonal.peakFactor,
      low: projected * seasonal.lowFactor
    }
  }
}

Load Testing

// k6 load test script
import http from 'k6/http'
import { check, sleep } from 'k6'
import { Rate } from 'k6/metrics'

const errorRate = new Rate('errors')

export const options = {
  stages: [
    { duration: '2m', target: 100 }, // Ramp up
    { duration: '5m', target: 100 }, // Stay at 100 users
    { duration: '2m', target: 200 }, // Ramp to 200
    { duration: '5m', target: 200 }, // Stay at 200
    { duration: '2m', target: 0 },   // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'], // 95% under 500ms
    errors: ['rate<0.1'],             // Error rate < 10%
  }
}

export default function() {
  const res = http.get('https://api.example.com/users')
  
  const success = check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500
  })
  
  errorRate.add(!success)
  sleep(1)
}

Reliability Patterns

Circuit Breaker

class CircuitBreaker {
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED'
  private failures = 0
  private lastFailureTime = 0
  private threshold = 5
  private timeout = 60000 // 1 minute
  
  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailureTime > this.timeout) {
        this.state = 'HALF_OPEN'
      } else {
        throw new Error('Circuit breaker is OPEN')
      }
    }
    
    try {
      const result = await fn()
      this.onSuccess()
      return result
    } catch (error) {
      this.onFailure()
      throw error
    }
  }
  
  private onSuccess() {
    this.failures = 0
    if (this.state === 'HALF_OPEN') {
      this.state = 'CLOSED'
    }
  }
  
  private onFailure() {
    this.failures++
    this.lastFailureTime = Date.now()
    
    if (this.failures >= this.threshold) {
      this.state = 'OPEN'
    }
  }
}

Retry with Exponential Backoff

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries = 3,
  baseDelay = 1000
): Promise<T> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn()
    } catch (error) {
      if (i === maxRetries - 1) throw error
      
      // Exponential backoff with jitter
      const delay = baseDelay * Math.pow(2, i)
      const jitter = Math.random() * 1000
      
      await new Promise(resolve => setTimeout(resolve, delay + jitter))
    }
  }
  
  throw new Error('Max retries exceeded')
}

// Usage
const data = await retryWithBackoff(() => 
  fetch('https://api.example.com/data')
)

Bulkhead Pattern

// Isolate resources to prevent cascading failures
class Bulkhead {
  private pools: Map<string, ResourcePool> = new Map()
  
  createPool(name: string, maxSize: number) {
    this.pools.set(name, new ResourcePool(maxSize))
  }
  
  async execute<T>(
    poolName: string,
    fn: () => Promise<T>
  ): Promise<T> {
    const pool = this.pools.get(poolName)
    if (!pool) throw new Error(`Pool ${poolName} not found`)
    
    return pool.execute(fn)
  }
}

class ResourcePool {
  private active = 0
  
  constructor(private maxSize: number) {}
  
  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.active >= this.maxSize) {
      throw new Error('Pool exhausted')
    }
    
    this.active++
    try {
      return await fn()
    } finally {
      this.active--
    }
  }
}

// Usage: Separate pools for different operations
const bulkhead = new Bulkhead()
bulkhead.createPool('database', 10)
bulkhead.createPool('external-api', 5)

await bulkhead.execute('database', () => db.query('...'))
await bulkhead.execute('external-api', () => fetch('...'))

On-Call Practices

On-Call Rotation

interface OnCallSchedule {
  primary: string
  secondary: string
  startTime: Date
  endTime: Date
  escalationPolicy: EscalationPolicy
}

interface EscalationPolicy {
  levels: Array<{
    waitMinutes: number
    notify: string[]
  }>
}

const schedule: OnCallSchedule = {
  primary: 'alice@example.com',
  secondary: 'bob@example.com',
  startTime: new Date('2024-01-01T00:00:00Z'),
  endTime: new Date('2024-01-08T00:00:00Z'),
  escalationPolicy: {
    levels: [
      { waitMinutes: 5, notify: ['primary'] },
      { waitMinutes: 10, notify: ['secondary'] },
      { waitMinutes: 15, notify: ['manager'] },
      { waitMinutes: 30, notify: ['director'] }
    ]
  }
}

Runbooks

# Runbook: High API Error Rate

## Alert
`error_rate > 1% for 5 minutes`

## Impact
- Users unable to complete purchases
- Revenue loss: ~$1000/minute

## Investigation Steps

1. Check error dashboard

https://grafana.example.com/d/api-errors


2. Identify error type
```bash
kubectl logs -l app=api --tail=100 | grep ERROR

Check recent deployments
```
kubectl rollout history deployment/api
```

Common Causes

1. Database Connection Pool Exhausted

Symptoms: Timeout errors, connection refused Fix:

# Scale up connection pool
kubectl set env deployment/api DB_POOL_SIZE=50

2. Downstream Service Failure

Symptoms: 503 errors, gateway timeout Fix:

# Enable circuit breaker
kubectl set env deployment/api CIRCUIT_BREAKER=true

3. Bad Deployment

Symptoms: Errors started after deployment Fix:

# Rollback to previous version
kubectl rollout undo deployment/api

Escalation

If not resolved in 15 minutes, escalate to #incident-critical

Post-Incident

Schedule post-mortem within 48 hours


## Chaos Engineering

```typescript
// Chaos experiment framework
interface ChaosExperiment {
  name: string
  hypothesis: string
  method: () => Promise<void>
  rollback: () => Promise<void>
  steadyState: () => Promise<boolean>
}

const latencyExperiment: ChaosExperiment = {
  name: 'API Latency Injection',
  
  hypothesis: 'System remains stable with 200ms added latency',
  
  async method() {
    // Inject latency
    await fetch('http://chaos-mesh/inject-latency', {
      method: 'POST',
      body: JSON.stringify({
        target: 'api-service',
        latency: '200ms',
        percentage: 50
      })
    })
  },
  
  async rollback() {
    // Remove latency injection
    await fetch('http://chaos-mesh/remove-latency', {
      method: 'POST'
    })
  },
  
  async steadyState() {
    // Check if system is healthy
    const metrics = await fetch('http://metrics/health')
    const data = await metrics.json()
    
    return (
      data.errorRate < 0.01 &&
      data.p95Latency < 500
    )
  }
}

// Run experiment
async function runChaosExperiment(experiment: ChaosExperiment) {
  // 1. Verify steady state
  const initialState = await experiment.steadyState()
  if (!initialState) throw new Error('System not in steady state')
  
  try {
    // 2. Inject failure
    await experiment.method()
    
    // 3. Monitor system
    await new Promise(resolve => setTimeout(resolve, 60000)) // 1 minute
    
    // 4. Verify hypothesis
    const finalState = await experiment.steadyState()
    
    if (finalState) {
      console.log('✅ Hypothesis validated')
    } else {
      console.log('❌ Hypothesis rejected - system degraded')
    }
  } finally {
    // 5. Always rollback
    await experiment.rollback()
  }
}

SRE Metrics

interface SREMetrics {
  // Availability
  uptime: number // percentage
  
  // Performance
  p50Latency: number
  p95Latency: number
  p99Latency: number
  
  // Reliability
  mtbf: number // Mean Time Between Failures
  mttr: number // Mean Time To Recovery
  
  // Toil
  toilPercentage: number
  
  // Error budget
  errorBudgetRemaining: number
}

class SREDashboard {
  async generateReport(): Promise<SREMetrics> {
    return {
      uptime: await this.calculateUptime(),
      p50Latency: await this.getLatency(0.50),
      p95Latency: await this.getLatency(0.95),
      p99Latency: await this.getLatency(0.99),
      mtbf: await this.calculateMTBF(),
      mttr: await this.calculateMTTR(),
      toilPercentage: await this.calculateToil(),
      errorBudgetRemaining: await this.getRemainingBudget()
    }
  }
}

✅ Define clear SLOs and SLIs ✅ Maintain error budgets ✅ Automate toil away ✅ Write comprehensive runbooks ✅ Practice incident response ✅ Conduct blameless post-mortems ✅ Invest in monitoring and observability ✅ Plan for capacity

Don't

❌ Ignore error budgets ❌ Alert on everything ❌ Skip runbook documentation ❌ Blame individuals for incidents ❌ Reactive capacity planning ❌ Manual toil > 50% of time ❌ Deploy without rollback plans ❌ Ignore SLO violations

Tools

Monitoring: Prometheus, Grafana, Datadog
Alerting: PagerDuty, Opsgenie, AlertManager
Incident Management: Incident.io, Jira
Chaos Engineering: Chaos Mesh, Gremlin
Load Testing: k6, Gatling, Locust
Observability: OpenTelemetry, Jaeger, Zipkin

Site Reliability Engineering (SRE)

Site Reliability Engineering (SRE)

Core Principles

1. Service Level Objectives (SLOs)

2. Service Level Indicators (SLIs)

3. Service Level Agreements (SLAs)

Error Budgets

Concept

Error Budget Policy

Toil Reduction

What is Toil?

Toil Budget

Monitoring & Alerting

The Four Golden Signals

Alert Design

Incident Management

Incident Lifecycle

Incident Response Process

Capacity Planning

Demand Forecasting

Load Testing

Reliability Patterns

Circuit Breaker

Retry with Exponential Backoff

Bulkhead Pattern

On-Call Practices

On-Call Rotation

Runbooks

Common Causes

1. Database Connection Pool Exhausted

2. Downstream Service Failure

3. Bad Deployment

Escalation

Post-Incident

SRE Metrics

Best Practices

Do

Don't

Tools

On this page