devbook
Product & Experimentation

A/B Testing

Experiment-driven product development through controlled testing

A/B Testing

A/B testing is a method of comparing two versions of a product to determine which performs better based on measurable outcomes.

Core Concepts

What is A/B Testing?

A/B testing (split testing) randomly shows users different versions of a feature to measure which performs better against a goal metric.

Key Components:

  • Control (A): Current version
  • Variant (B): New version
  • Metric: What you're measuring
  • Sample Size: Number of users needed
  • Statistical Significance: Confidence in results

Experiment Design

1. Define Hypothesis

## Hypothesis Template

**Current Situation:** 
Users abandon checkout at 45% rate

**Proposed Change:**
Add trust badges to checkout page

**Expected Outcome:**
Reduce abandonment by 10%

**Success Metric:**
Checkout completion rate

2. Choose Metrics

interface ExperimentMetrics {
  // Primary metric (one only)
  primary: {
    name: 'conversion_rate'
    target: 0.15 // 15% improvement
  }
  
  // Secondary metrics
  secondary: [
    { name: 'average_order_value', target: null },
    { name: 'time_to_checkout', target: null }
  ]
  
  // Guardrail metrics (shouldn't decrease)
  guardrails: [
    { name: 'page_load_time', threshold: 2000 }, // ms
    { name: 'error_rate', threshold: 0.01 } // 1%
  ]
}

3. Calculate Sample Size

function calculateSampleSize(
  baselineRate: number,
  minimumDetectableEffect: number,
  significance: number = 0.05,
  power: number = 0.8
): number {
  // Simplified calculation
  const p1 = baselineRate
  const p2 = baselineRate * (1 + minimumDetectableEffect)
  
  const z_alpha = 1.96 // 95% confidence
  const z_beta = 0.84  // 80% power
  
  const pooled = (p1 + p2) / 2
  
  const n = Math.pow(
    (z_alpha * Math.sqrt(2 * pooled * (1 - pooled)) + 
     z_beta * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))),
    2
  ) / Math.pow(p1 - p2, 2)
  
  return Math.ceil(n)
}

// Example
const sampleSize = calculateSampleSize(
  0.10,  // 10% baseline conversion
  0.15,  // Want to detect 15% improvement
  0.05,  // 95% confidence
  0.8    // 80% power
)
// Result: ~3,500 users per variant

Implementation

Feature Flag Integration

import { useFeatureFlag } from './flags'

function CheckoutPage() {
  const showTrustBadges = useFeatureFlag('checkout-trust-badges', {
    userId: user.id,
    attributes: {
      country: user.country,
      plan: user.plan
    }
  })
  
  return (
    <div>
      {showTrustBadges ? (
        <CheckoutWithBadges />
      ) : (
        <CheckoutOriginal />
      )}
    </div>
  )
}

Event Tracking

interface ExperimentEvent {
  experimentId: string
  variant: 'control' | 'treatment'
  userId: string
  timestamp: Date
  event: string
  value?: number
}

class ExperimentTracker {
  track(event: ExperimentEvent) {
    // Send to analytics
    analytics.track('experiment_event', {
      experiment_id: event.experimentId,
      variant: event.variant,
      user_id: event.userId,
      event_name: event.event,
      value: event.value,
      timestamp: event.timestamp
    })
  }
  
  trackConversion(experimentId: string, variant: string, value: number) {
    this.track({
      experimentId,
      variant: variant as 'control' | 'treatment',
      userId: getCurrentUserId(),
      timestamp: new Date(),
      event: 'conversion',
      value
    })
  }
}

Statistical Analysis

Calculate Results

interface ExperimentResults {
  control: {
    users: number
    conversions: number
    rate: number
  }
  treatment: {
    users: number
    conversions: number
    rate: number
  }
  improvement: number
  pValue: number
  significant: boolean
}

function analyzeExperiment(data: ExperimentData): ExperimentResults {
  const controlRate = data.control.conversions / data.control.users
  const treatmentRate = data.treatment.conversions / data.treatment.users
  
  const improvement = (treatmentRate - controlRate) / controlRate
  
  // Calculate p-value (simplified)
  const pValue = calculatePValue(data)
  
  return {
    control: {
      users: data.control.users,
      conversions: data.control.conversions,
      rate: controlRate
    },
    treatment: {
      users: data.treatment.users,
      conversions: data.treatment.conversions,
      rate: treatmentRate
    },
    improvement,
    pValue,
    significant: pValue < 0.05
  }
}

Common Pitfalls

1. Peeking at Results

// ❌ Bad: Check results daily and stop early
if (pValue < 0.05 && daysSinceStart < 7) {
  stopExperiment() // Don't do this!
}

// ✅ Good: Wait for planned duration
if (daysSinceStart >= plannedDuration && sampleSizeReached) {
  const results = analyzeExperiment()
  if (results.significant) {
    rolloutWinner()
  }
}

2. Multiple Testing

// ❌ Bad: Test many metrics without correction
const significantMetrics = metrics.filter(m => m.pValue < 0.05)

// ✅ Good: Bonferroni correction
const adjustedAlpha = 0.05 / metrics.length
const significantMetrics = metrics.filter(m => m.pValue < adjustedAlpha)

Testing Tools

LaunchDarkly Example

import { useLDClient } from 'launchdarkly-react-client-sdk'

function ProductPage() {
  const ldClient = useLDClient()
  const newLayout = ldClient?.variation('product-page-layout', false)
  
  useEffect(() => {
    if (newLayout) {
      ldClient?.track('product-page-viewed', {
        variant: 'new-layout'
      })
    }
  }, [newLayout])
  
  return newLayout ? <NewLayout /> : <OldLayout />
}

Optimizely Example

import { OptimizelyProvider, useExperiment } from '@optimizely/react-sdk'

function CheckoutButton() {
  const [variation, clientReady] = useExperiment('checkout-button-color')
  
  const buttonColor = variation === 'red' ? '#ff0000' : '#0070f3'
  
  return (
    <button style={{ backgroundColor: buttonColor }}>
      Checkout
    </button>
  )
}

Best Practices

Do

✅ Have a clear hypothesis ✅ Calculate sample size beforehand ✅ Run for full business cycle ✅ Test one thing at a time ✅ Wait for statistical significance ✅ Consider seasonality ✅ Document everything ✅ Segment results

Don't

❌ Peek at results early ❌ Stop test too soon ❌ Change test mid-flight ❌ Test too many things ❌ Ignore guardrail metrics ❌ Run without power analysis ❌ Deploy without verification ❌ Forget about novelty effect

Multivariate Testing

interface MultivariateTest {
  factors: {
    buttonColor: ['blue', 'red', 'green']
    buttonText: ['Buy Now', 'Add to Cart', 'Purchase']
    trustBadge: [true, false]
  }
  // Total combinations: 3 × 3 × 2 = 18 variants
}

// Requires much larger sample size
const mvtSampleSize = calculateMVTSampleSize({
  variants: 18,
  baselineRate: 0.10,
  effect: 0.15
})

Advanced Techniques

Sequential Testing

// SPRT (Sequential Probability Ratio Test)
function shouldStopExperiment(data: ExperimentData): {
  stop: boolean
  winner?: 'control' | 'treatment'
} {
  const ratio = calculateLikelihoodRatio(data)
  
  if (ratio > upperBound) {
    return { stop: true, winner: 'treatment' }
  }
  
  if (ratio < lowerBound) {
    return { stop: true, winner: 'control' }
  }
  
  return { stop: false }
}

Bayesian A/B Testing

function bayesianAnalysis(data: ExperimentData) {
  // Calculate probability that B beats A
  const probBBeatsA = calculatePosterior(data)
  
  return {
    probabilityOfSuperiority: probBBeatsA,
    expectedLift: calculateExpectedLift(data),
    credibleInterval: calculateCredibleInterval(data)
  }
}

Tools & Platforms

  • Optimizely: Enterprise A/B testing
  • Google Optimize: Free A/B testing
  • VWO: Conversion optimization
  • LaunchDarkly: Feature flagging + experiments
  • Split.io: Feature delivery platform
  • Statsig: Experimentation platform

Resources

  • "Trustworthy Online Controlled Experiments" - Kohavi et al.
  • Evan Miller's A/B testing tools
  • GrowthBook (open source)