devbook
Backend & Infrastructure

Site Reliability Engineering (SRE)

Building and operating reliable, scalable systems at scale

Site Reliability Engineering (SRE)

SRE is a discipline that applies software engineering principles to infrastructure and operations problems.

Core Principles

1. Service Level Objectives (SLOs)

SLO = Target reliability level for a service

interface SLO {
  name: string
  target: number // e.g., 99.9%
  window: string // e.g., "30d"
  indicator: SLI
}

const apiSLO: SLO = {
  name: 'API Availability',
  target: 99.9,
  window: '30d',
  indicator: {
    type: 'availability',
    measurement: 'successful_requests / total_requests'
  }
}

2. Service Level Indicators (SLIs)

SLI = Quantitative measure of service level

// Common SLIs
const slis = {
  // Availability
  availability: {
    measurement: 'successful_requests / total_requests',
    target: 0.999 // 99.9%
  },
  
  // Latency
  latency: {
    measurement: 'requests_under_200ms / total_requests',
    target: 0.95 // 95th percentile < 200ms
  },
  
  // Throughput
  throughput: {
    measurement: 'requests_per_second',
    target: 1000 // min 1000 rps
  },
  
  // Error rate
  errorRate: {
    measurement: 'error_requests / total_requests',
    target: 0.001 // < 0.1%
  }
}

3. Service Level Agreements (SLAs)

SLA = Contractual agreement with consequences

interface SLA {
  slo: SLO
  consequences: {
    breach: string
    penalty: string
  }
}

const apiSLA: SLA = {
  slo: apiSLO,
  consequences: {
    breach: '< 99.9% uptime in 30 days',
    penalty: '10% service credit'
  }
}

Error Budgets

Concept

class ErrorBudget {
  private slo: number // e.g., 99.9%
  private window: number // in seconds
  
  calculateBudget(): number {
    const allowedDowntime = (1 - this.slo / 100) * this.window
    return allowedDowntime
  }
  
  getRemainingBudget(actualUptime: number): number {
    const allowedDowntime = this.calculateBudget()
    const actualDowntime = this.window - actualUptime
    return allowedDowntime - actualDowntime
  }
}

// Example: 99.9% SLO for 30 days
const budget = new ErrorBudget()
budget.slo = 99.9
budget.window = 30 * 24 * 60 * 60 // 30 days in seconds

// Allowed downtime: ~43 minutes per month
console.log(budget.calculateBudget()) // 2,592 seconds (~43 min)

Error Budget Policy

interface ErrorBudgetPolicy {
  onBudgetExhausted: Action[]
  onBudgetWarning: Action[]
  onBudgetHealthy: Action[]
}

const policy: ErrorBudgetPolicy = {
  onBudgetExhausted: [
    'Freeze feature releases',
    'Focus on reliability',
    'Cancel non-critical deployments',
    'Conduct incident review'
  ],
  
  onBudgetWarning: [
    'Increase monitoring',
    'Review upcoming changes',
    'Prepare rollback plans'
  ],
  
  onBudgetHealthy: [
    'Continue normal operations',
    'Consider new features',
    'Controlled experiments OK'
  ]
}

Toil Reduction

What is Toil?

Toil is manual, repetitive, automatable work that doesn't provide lasting value.

// Toil characteristics
interface Toil {
  isManual: boolean
  isRepetitive: boolean
  isAutomatable: boolean
  hasNoValue: boolean // No enduring value
  scalesLinearly: boolean // O(n) with service growth
}

// Example: Manual deployment toil
const manualDeployment: Toil = {
  isManual: true, // Someone SSH's into servers
  isRepetitive: true, // Same steps every time
  isAutomatable: true, // Could be scripted/automated
  hasNoValue: true, // Doesn't improve the service
  scalesLinearly: true // More services = more manual work
}

// Solution: Automate!
async function automatedDeploy(service: string, version: string) {
  await runTests()
  await buildImage(version)
  await pushToRegistry()
  await updateKubernetes(service, version)
  await waitForRollout()
  await runSmokeTests()
}

Toil Budget

// SRE teams should spend < 50% time on toil
interface ToilTracking {
  totalHours: number
  toilHours: number
  engineeringHours: number
  
  getToilPercentage(): number {
    return (this.toilHours / this.totalHours) * 100
  }
}

const teamToil: ToilTracking = {
  totalHours: 160, // 1 month
  toilHours: 60,
  engineeringHours: 100,
  
  getToilPercentage() {
    return (this.toilHours / this.totalHours) * 100 // 37.5%
  }
}

Monitoring & Alerting

The Four Golden Signals

interface GoldenSignals {
  // 1. Latency
  latency: {
    p50: number
    p95: number
    p99: number
  }
  
  // 2. Traffic
  traffic: {
    requestsPerSecond: number
    activeUsers: number
  }
  
  // 3. Errors
  errors: {
    errorRate: number
    errorCount: number
  }
  
  // 4. Saturation
  saturation: {
    cpuUtilization: number
    memoryUtilization: number
    diskUtilization: number
  }
}

// Implementing metrics collection
class MetricsCollector {
  async collectGoldenSignals(): Promise<GoldenSignals> {
    return {
      latency: await this.getLatencyMetrics(),
      traffic: await this.getTrafficMetrics(),
      errors: await this.getErrorMetrics(),
      saturation: await this.getSaturationMetrics()
    }
  }
  
  private async getLatencyMetrics() {
    // Query Prometheus/monitoring system
    return {
      p50: await this.query('histogram_quantile(0.5, http_request_duration_seconds)'),
      p95: await this.query('histogram_quantile(0.95, http_request_duration_seconds)'),
      p99: await this.query('histogram_quantile(0.99, http_request_duration_seconds)')
    }
  }
}

Alert Design

interface Alert {
  name: string
  condition: string
  severity: 'critical' | 'warning' | 'info'
  notify: string[]
  runbook: string
}

// Good alert example
const goodAlert: Alert = {
  name: 'High Error Rate',
  condition: 'error_rate > 0.01 for 5 minutes', // Clear threshold
  severity: 'critical',
  notify: ['pagerduty', 'slack-incidents'],
  runbook: 'https://wiki.example.com/runbooks/high-error-rate'
}

// Alert best practices
const alertRules = {
  // Use symptom-based alerts
  symptom: {
    good: 'error_rate > 1%',
    bad: 'server_down' // This is a cause, not symptom
  },
  
  // Include duration to reduce noise
  duration: {
    good: 'latency > 1s for 5 minutes',
    bad: 'latency > 1s' // Too noisy
  },
  
  // Link to runbooks
  actionable: {
    good: 'runbook: https://wiki.example.com/runbooks/...',
    bad: 'No runbook provided'
  }
}

Incident Management

Incident Lifecycle

type IncidentStatus = 
  | 'detected'
  | 'acknowledged'
  | 'investigating'
  | 'identified'
  | 'resolved'
  | 'closed'

interface Incident {
  id: string
  title: string
  severity: 'sev1' | 'sev2' | 'sev3'
  status: IncidentStatus
  startedAt: Date
  resolvedAt?: Date
  commander: string
  responders: string[]
  timeline: TimelineEvent[]
  impactedServices: string[]
  customerImpact: string
}

// Incident roles
interface IncidentRoles {
  commander: {
    responsibilities: [
      'Coordinate response',
      'Make decisions',
      'Communicate status',
      'Delegate tasks'
    ]
  }
  
  communicator: {
    responsibilities: [
      'Update stakeholders',
      'Post status updates',
      'Manage external comms'
    ]
  }
  
  scribe: {
    responsibilities: [
      'Document timeline',
      'Record decisions',
      'Track action items'
    ]
  }
  
  responders: {
    responsibilities: [
      'Investigate issues',
      'Implement fixes',
      'Monitor impact'
    ]
  }
}

Incident Response Process

class IncidentResponse {
  async handleIncident(incident: Incident) {
    // 1. Detect & Alert
    await this.page(incident.severity)
    
    // 2. Triage
    await this.assessSeverity(incident)
    await this.assembleTeam(incident)
    
    // 3. Investigate
    await this.gatherData(incident)
    await this.formHypothesis(incident)
    
    // 4. Mitigate
    await this.implementFix(incident)
    await this.verifyResolution(incident)
    
    // 5. Resolve
    await this.declareResolved(incident)
    
    // 6. Follow-up
    await this.schedulePostMortem(incident)
    await this.trackActionItems(incident)
  }
  
  private async assessSeverity(incident: Incident) {
    // SEV1: Critical user impact, all hands on deck
    // SEV2: Significant impact, page on-call
    // SEV3: Minor impact, can wait for business hours
    
    const severity = this.calculateSeverity({
      userImpact: incident.impactedServices.length,
      revenueImpact: await this.estimateRevenueLoss(incident),
      dataLoss: incident.dataLoss
    })
    
    incident.severity = severity
  }
}

Capacity Planning

Demand Forecasting

interface CapacityPlan {
  service: string
  currentCapacity: number
  projectedDemand: number
  timeHorizon: string
  recommendations: Action[]
}

class CapacityPlanner {
  async forecastDemand(service: string): Promise<CapacityPlan> {
    // Historical data
    const historicalGrowth = await this.getGrowthRate(service)
    
    // Current usage
    const currentUsage = await this.getCurrentUsage(service)
    
    // Project future demand
    const projection = this.projectDemand(
      currentUsage,
      historicalGrowth,
      { months: 6 }
    )
    
    // Calculate headroom
    const currentCapacity = await this.getCapacity(service)
    const utilizationTarget = 0.70 // 70% utilization target
    
    if (projection.peak > currentCapacity * utilizationTarget) {
      return {
        service,
        currentCapacity,
        projectedDemand: projection.peak,
        timeHorizon: '6 months',
        recommendations: [
          'Scale horizontally: Add 3 more instances',
          'Estimated cost: $500/month',
          'Implementation timeline: 2 weeks'
        ]
      }
    }
  }
  
  private projectDemand(
    current: number,
    growthRate: number,
    horizon: { months: number }
  ) {
    // Simple exponential growth model
    const projected = current * Math.pow(1 + growthRate, horizon.months)
    
    // Add seasonal variations
    const seasonal = this.getSeasonalFactors(horizon.months)
    
    return {
      average: projected,
      peak: projected * seasonal.peakFactor,
      low: projected * seasonal.lowFactor
    }
  }
}

Load Testing

// k6 load test script
import http from 'k6/http'
import { check, sleep } from 'k6'
import { Rate } from 'k6/metrics'

const errorRate = new Rate('errors')

export const options = {
  stages: [
    { duration: '2m', target: 100 }, // Ramp up
    { duration: '5m', target: 100 }, // Stay at 100 users
    { duration: '2m', target: 200 }, // Ramp to 200
    { duration: '5m', target: 200 }, // Stay at 200
    { duration: '2m', target: 0 },   // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'], // 95% under 500ms
    errors: ['rate<0.1'],             // Error rate < 10%
  }
}

export default function() {
  const res = http.get('https://api.example.com/users')
  
  const success = check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500
  })
  
  errorRate.add(!success)
  sleep(1)
}

Reliability Patterns

Circuit Breaker

class CircuitBreaker {
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED'
  private failures = 0
  private lastFailureTime = 0
  private threshold = 5
  private timeout = 60000 // 1 minute
  
  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailureTime > this.timeout) {
        this.state = 'HALF_OPEN'
      } else {
        throw new Error('Circuit breaker is OPEN')
      }
    }
    
    try {
      const result = await fn()
      this.onSuccess()
      return result
    } catch (error) {
      this.onFailure()
      throw error
    }
  }
  
  private onSuccess() {
    this.failures = 0
    if (this.state === 'HALF_OPEN') {
      this.state = 'CLOSED'
    }
  }
  
  private onFailure() {
    this.failures++
    this.lastFailureTime = Date.now()
    
    if (this.failures >= this.threshold) {
      this.state = 'OPEN'
    }
  }
}

Retry with Exponential Backoff

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries = 3,
  baseDelay = 1000
): Promise<T> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn()
    } catch (error) {
      if (i === maxRetries - 1) throw error
      
      // Exponential backoff with jitter
      const delay = baseDelay * Math.pow(2, i)
      const jitter = Math.random() * 1000
      
      await new Promise(resolve => setTimeout(resolve, delay + jitter))
    }
  }
  
  throw new Error('Max retries exceeded')
}

// Usage
const data = await retryWithBackoff(() => 
  fetch('https://api.example.com/data')
)

Bulkhead Pattern

// Isolate resources to prevent cascading failures
class Bulkhead {
  private pools: Map<string, ResourcePool> = new Map()
  
  createPool(name: string, maxSize: number) {
    this.pools.set(name, new ResourcePool(maxSize))
  }
  
  async execute<T>(
    poolName: string,
    fn: () => Promise<T>
  ): Promise<T> {
    const pool = this.pools.get(poolName)
    if (!pool) throw new Error(`Pool ${poolName} not found`)
    
    return pool.execute(fn)
  }
}

class ResourcePool {
  private active = 0
  
  constructor(private maxSize: number) {}
  
  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.active >= this.maxSize) {
      throw new Error('Pool exhausted')
    }
    
    this.active++
    try {
      return await fn()
    } finally {
      this.active--
    }
  }
}

// Usage: Separate pools for different operations
const bulkhead = new Bulkhead()
bulkhead.createPool('database', 10)
bulkhead.createPool('external-api', 5)

await bulkhead.execute('database', () => db.query('...'))
await bulkhead.execute('external-api', () => fetch('...'))

On-Call Practices

On-Call Rotation

interface OnCallSchedule {
  primary: string
  secondary: string
  startTime: Date
  endTime: Date
  escalationPolicy: EscalationPolicy
}

interface EscalationPolicy {
  levels: Array<{
    waitMinutes: number
    notify: string[]
  }>
}

const schedule: OnCallSchedule = {
  primary: 'alice@example.com',
  secondary: 'bob@example.com',
  startTime: new Date('2024-01-01T00:00:00Z'),
  endTime: new Date('2024-01-08T00:00:00Z'),
  escalationPolicy: {
    levels: [
      { waitMinutes: 5, notify: ['primary'] },
      { waitMinutes: 10, notify: ['secondary'] },
      { waitMinutes: 15, notify: ['manager'] },
      { waitMinutes: 30, notify: ['director'] }
    ]
  }
}

Runbooks

# Runbook: High API Error Rate

## Alert
`error_rate > 1% for 5 minutes`

## Impact
- Users unable to complete purchases
- Revenue loss: ~$1000/minute

## Investigation Steps

1. Check error dashboard

https://grafana.example.com/d/api-errors


2. Identify error type
```bash
kubectl logs -l app=api --tail=100 | grep ERROR
  1. Check recent deployments
    kubectl rollout history deployment/api

Common Causes

1. Database Connection Pool Exhausted

Symptoms: Timeout errors, connection refused Fix:

# Scale up connection pool
kubectl set env deployment/api DB_POOL_SIZE=50

2. Downstream Service Failure

Symptoms: 503 errors, gateway timeout Fix:

# Enable circuit breaker
kubectl set env deployment/api CIRCUIT_BREAKER=true

3. Bad Deployment

Symptoms: Errors started after deployment Fix:

# Rollback to previous version
kubectl rollout undo deployment/api

Escalation

If not resolved in 15 minutes, escalate to #incident-critical

Post-Incident

Schedule post-mortem within 48 hours


## Chaos Engineering

```typescript
// Chaos experiment framework
interface ChaosExperiment {
  name: string
  hypothesis: string
  method: () => Promise<void>
  rollback: () => Promise<void>
  steadyState: () => Promise<boolean>
}

const latencyExperiment: ChaosExperiment = {
  name: 'API Latency Injection',
  
  hypothesis: 'System remains stable with 200ms added latency',
  
  async method() {
    // Inject latency
    await fetch('http://chaos-mesh/inject-latency', {
      method: 'POST',
      body: JSON.stringify({
        target: 'api-service',
        latency: '200ms',
        percentage: 50
      })
    })
  },
  
  async rollback() {
    // Remove latency injection
    await fetch('http://chaos-mesh/remove-latency', {
      method: 'POST'
    })
  },
  
  async steadyState() {
    // Check if system is healthy
    const metrics = await fetch('http://metrics/health')
    const data = await metrics.json()
    
    return (
      data.errorRate < 0.01 &&
      data.p95Latency < 500
    )
  }
}

// Run experiment
async function runChaosExperiment(experiment: ChaosExperiment) {
  // 1. Verify steady state
  const initialState = await experiment.steadyState()
  if (!initialState) throw new Error('System not in steady state')
  
  try {
    // 2. Inject failure
    await experiment.method()
    
    // 3. Monitor system
    await new Promise(resolve => setTimeout(resolve, 60000)) // 1 minute
    
    // 4. Verify hypothesis
    const finalState = await experiment.steadyState()
    
    if (finalState) {
      console.log('✅ Hypothesis validated')
    } else {
      console.log('❌ Hypothesis rejected - system degraded')
    }
  } finally {
    // 5. Always rollback
    await experiment.rollback()
  }
}

SRE Metrics

interface SREMetrics {
  // Availability
  uptime: number // percentage
  
  // Performance
  p50Latency: number
  p95Latency: number
  p99Latency: number
  
  // Reliability
  mtbf: number // Mean Time Between Failures
  mttr: number // Mean Time To Recovery
  
  // Toil
  toilPercentage: number
  
  // Error budget
  errorBudgetRemaining: number
}

class SREDashboard {
  async generateReport(): Promise<SREMetrics> {
    return {
      uptime: await this.calculateUptime(),
      p50Latency: await this.getLatency(0.50),
      p95Latency: await this.getLatency(0.95),
      p99Latency: await this.getLatency(0.99),
      mtbf: await this.calculateMTBF(),
      mttr: await this.calculateMTTR(),
      toilPercentage: await this.calculateToil(),
      errorBudgetRemaining: await this.getRemainingBudget()
    }
  }
}

Best Practices

Do

✅ Define clear SLOs and SLIs ✅ Maintain error budgets ✅ Automate toil away ✅ Write comprehensive runbooks ✅ Practice incident response ✅ Conduct blameless post-mortems ✅ Invest in monitoring and observability ✅ Plan for capacity

Don't

❌ Ignore error budgets ❌ Alert on everything ❌ Skip runbook documentation ❌ Blame individuals for incidents ❌ Reactive capacity planning ❌ Manual toil > 50% of time ❌ Deploy without rollback plans ❌ Ignore SLO violations

Tools

  • Monitoring: Prometheus, Grafana, Datadog
  • Alerting: PagerDuty, Opsgenie, AlertManager
  • Incident Management: Incident.io, Jira
  • Chaos Engineering: Chaos Mesh, Gremlin
  • Load Testing: k6, Gatling, Locust
  • Observability: OpenTelemetry, Jaeger, Zipkin