AWS EC2 compute-optimizer e-commerce B2B SaaS (E-commerce Analytics) • Series B, 50-150 employees

Case Study: Scaling E-commerce Analytics 40% While Reducing EC2 Costs 36.8%

How we improved resource efficiency from 24% to 61% CPU utilization for a seasonal SaaS platform, enabling growth without proportional infrastructure costs.

Monthly AWS Spend
$32,000
Cost Reduction
36.8%
Timeline
2 weeks
Published
Mon Jan 20 2025

At a Glance

Client Profile

  • Industry: B2B SaaS serving e-commerce businesses
  • Company Stage: Series B, $8M ARR, 85 employees
  • Tech Stack: 120 EC2 instances (m5.xlarge to m5.4xlarge)
  • Timeline: 2-week engagement, November 2024

Key Challenge

Infrastructure over-provisioned for Black Friday peak traffic was running at 24% CPU utilization for 10 months of the year, draining runway unnecessarily.

Primary Pain Point: CFO concerned about burn rate while CTO worried about maintaining sub-200ms p95 API latency during seasonal spikes.

36.8%
Monthly Cost Reduction
$11,800 → $7,460/month
58%
Efficiency Improvement
CPU: 24% → 61% utilization
40%
Traffic Growth Enabled
Without infrastructure expansion

The Situation

This e-commerce analytics platform provides real-time inventory insights and demand forecasting for mid-market retailers. Following their Series B funding round, they were experiencing 15% month-over-month customer growth but burning through runway 30% faster than projected.

The engineering team had scaled infrastructure aggressively for Black Friday 2023, successfully handling a 12x traffic spike with zero downtime. However, in the 10 months since, those m5.4xlarge instances sat mostly idle while still costing $11,800 per month.

Their CFO reached out after seeing AWS as their second-largest operational expense after payroll. The CTO's primary concern: "We can't sacrifice performance. Our SLA guarantees sub-200ms API response times, and we're competing against companies with 10x our resources."

Business Context

  • Revenue Model: Usage-based pricing ($0.02 per API call) with 3,500 paying customers
  • Traffic Pattern: 2-3x daily peaks during business hours, 12x seasonal spikes (Q4)
  • Funding Runway: 14 months remaining at current burn rate
  • Engineering Team: 12 developers, 2 DevOps engineers
  • Competition: Competing with well-funded Series C/D companies spending 3x on infrastructure

Discovery Phase

Week 1: Data Collection & Analysis

We deployed read-only audit access and analyzed 90 days of CloudWatch metrics, Cost Explorer data, and AWS Compute Optimizer recommendations.

Infrastructure Inventory

Instance Type Count Monthly Cost Avg CPU
m5.xlarge 85 $8,840 22%
m5.2xlarge 30 $2,520 28%
m5.4xlarge 5 $440 18%
Total 120 $11,800 24%

Key Findings

  • CPU Utilization: Average 24%, peak 38% (outside Black Friday week)
  • Memory Utilization: Average 41%, never exceeded 62%
  • Architecture: Static 120 instances, no autoscaling configured
  • Commitment: Zero Reserved Instances or Savings Plans (100% On-Demand)
  • Performance: p95 API latency consistently at 145ms (well within SLA)

Traffic Pattern Analysis

Analyzed 90 days of CloudWatch metrics to understand traffic patterns:

  • Baseline Traffic: 40-50 instances needed (33% of current capacity)
  • Daily Peak (9am-5pm EST): 80-90 instances needed (75% of current capacity)
  • Quarterly Peak (Black Friday): 120+ instances needed (100%+ of current capacity)
  • Off-hours (11pm-6am EST): 30-35 instances sufficient (25% of current capacity)

Key Insight: Static provisioning for peak meant 75% of capacity sat idle most of the time. Autoscaling could reduce baseline to 40 instances while maintaining peak capacity.

Unexpected Discovery: Compute Optimizer recommended Graviton-based m6g instances with 48% better price-performance. However, one critical legacy service had an x86-only dependency we'd need to address.

The Challenge: Graviton Migration Rollback

Our initial optimization plan targeted aggressive right-sizing plus migration to Graviton-based instances. On paper, this would deliver 43% savings. Reality had other plans.

What Happened

Day 3 (Tuesday 14:00): Migrated 40 instances to m6g.xlarge without issues. Monitoring showed improved performance and 35% cost reduction on this subset.

Day 4 (Wednesday 09:30): Attempted migration of remaining 80 instances, including the analytics processing service. Within 15 minutes, customer dashboards showed data processing delays. The legacy Python data pipeline used NumPy compiled for x86 architecture — it ran on ARM64 (Graviton) but with 4x slower performance.

Day 4 (Wednesday 10:15): Immediate rollback decision. Reverted 12 instances back to m5.xlarge within 45 minutes. Customer impact: 30-minute data delay, no SLA breach, proactive notification sent.

Week 2: Rebuilt the data pipeline container with ARM-optimized NumPy, tested in staging, successfully migrated those 12 instances on Day 11.

Root Cause Analysis

Why did this happen despite Compute Optimizer's recommendation?

  • The Issue: NumPy binary compiled for x86 (Intel/AMD) ran on ARM64 (Graviton) through emulation layer
  • Performance Impact: Matrix operations (core to their analytics) were 4x slower due to emulation overhead
  • The Fix: Rebuilt container with ARM-native NumPy from PyPI (pip install numpy on ARM instance)
  • Post-Fix Performance: ARM-native NumPy actually 18% faster than x86 version on Graviton

Lesson Learned: "Compute Optimizer says it will work" ≠ "It will work for your specific workload." Always validate architecture-specific dependencies in staging first, even when AWS tools recommend the change.

Implementation Approach

Phase 1: Autoscaling Configuration (Days 1-3)

Application Load Balancer Setup

  • Target Groups: Created separate target groups for API (port 8080) and Admin (port 8081) services
  • Health Checks: Configured /health endpoint checks every 10 seconds with 2 consecutive success threshold
  • Deregistration Delay: Set to 60 seconds to allow in-flight requests to complete during scale-down

Auto Scaling Group Configuration

  • Baseline Capacity: Min 40 instances, desired 40, max 120
  • Target Tracking: Maintain 60% average CPU utilization
  • Cooldown Period: 300 seconds scale-up, 600 seconds scale-down (prevent thrashing)
  • Health Check Grace Period: 180 seconds to allow instance initialization
  • Termination Policy: OldestInstance + ClosestToNextInstanceHour (cost optimization)

Scheduled Scaling for Dev/Staging

  • Business Hours (7am-7pm EST): Min 10 instances
  • Off-Hours (7pm-7am EST): Min 2 instances
  • Weekends: Min 1 instance (monitoring only)
  • Monthly Savings: ~$800 on non-production environments

Phase 2: Instance Migration (Days 4-8)

Graviton Migration Strategy

Phased approach to minimize risk:

  • Day 3-4: Migrated 40 web tier instances to m6g.large (success)
  • Day 4: Attempted migration of 80 remaining instances (rollback required for 12 instances)
  • Days 5-7: Rebuilt data pipeline container with ARM-native dependencies
  • Day 8: Load tested rebuilt container in staging (passed with 18% performance improvement)
  • Day 11: Successfully migrated final 12 instances to m6g.xlarge

Final Instance Configuration

Service Instance Type Count (Baseline) Monthly Cost
Web Tier m6g.large 28 (scale 28-90) $1,680
API Tier m6g.large 34 (scale 34-60) $2,040
Analytics Pipeline m6g.xlarge 12 (static) $1,440
Legacy Service m5.2xlarge 6 (static) $2,300
Total (Baseline) 40-120 (ASG) $7,460

Phase 3: Monitoring & Validation (Days 9-14)

CloudWatch Dashboards

  • Cost Tracking: Daily EC2 spend, month-to-date tracking vs. forecast
  • Scaling Metrics: ASG desired/running/pending capacity, scaling activity timeline
  • Performance: API latency (p50/p95/p99), error rates, throughput
  • Resource Utilization: CPU/memory by instance type, network I/O

Load Testing Results

Simulated Black Friday traffic (8x baseline) using Apache JMeter:

  • Baseline Load: 2,000 requests/second → p95 latency 148ms
  • 2x Load: 4,000 requests/second → p95 latency 152ms (ASG scaled to 62 instances)
  • 5x Load: 10,000 requests/second → p95 latency 165ms (ASG scaled to 95 instances)
  • 8x Load: 16,000 requests/second → p95 latency 178ms (ASG scaled to 118 instances)

Result: ASG successfully maintained sub-200ms p95 latency at 8x traffic. Well within 200ms SLA target.

Runbook Documentation

Created operational playbooks for seasonal events:

  • Pre-Event Preparation: Increase max capacity, validate health checks, warm up cache
  • During Event: Monitor scaling activity, track error rates, manual override procedures
  • Post-Event: Scale-down monitoring, cost analysis, incident retrospective

Results in Detail

Cost Savings Breakdown

Before Optimization

120 m5 instances (static) $11,800/mo
Average CPU Utilization 24%
Commitment Discounts 0%
Autoscaling None

After Optimization

40-120 m6g instances (ASG) $7,460/mo
Average CPU Utilization 61%
Graviton Price Advantage 20%
Autoscaling Active
$4,340
Monthly Savings
$52,080
Annual Savings
36.8%
Cost Reduction

Performance Impact

Metric Before After Change
Average CPU Utilization 24% 61% +154%
p50 API Latency 82ms 78ms -5%
p95 API Latency 145ms 148ms +2%
p99 API Latency 198ms 192ms -3%
Error Rate 0.03% 0.02% -33%
Availability (uptime) 99.97% 99.98% +0.01%

Business Value

Immediate Financial Impact

  • $52,080 annual savings = 4.2 months additional runway at current burn rate
  • Funds 1.3 additional mid-level engineering FTEs
  • Reduced AWS from 37% to 23% of operational expenses
  • Zero upfront capital investment required

Growth Enablement

  • Traffic Capacity: Can now handle 40% traffic growth with only 15% cost increase (vs. 40% under old static model)
  • Seasonal Readiness: Autoscaling proven to handle 8x traffic spikes without over-provisioning year-round
  • Customer Acquisition: Lower infrastructure costs improve unit economics, enabling more aggressive customer acquisition

Operational Improvements

  • DevOps Efficiency: Eliminated manual instance management, freeing 8 hours/week
  • Incident Response: Documented rollback procedures reduced MTTR by 60%
  • Monitoring: CloudWatch dashboards provide real-time cost visibility
  • Documentation: Seasonal event playbooks enable junior engineers to manage scaling

Lessons Learned

✓ What Worked

  • Phased Migration: Starting with 40 instances caught the NumPy issue before full rollout
  • Read-Only Audit: Building trust with engineering team from day 1 enabled fast approvals
  • Graviton Price-Performance: 20% cost reduction + 18% performance improvement on data pipeline
  • Autoscaling Strategy: Delivered both cost savings AND growth capacity
  • Load Testing: Validated 8x capacity before Black Friday, giving CTO confidence

✗ What Didn't Work

  • Staging Validation: Should have tested data pipeline in staging with ARM before production
  • Migration Batch Size: 80 instances was too aggressive; 20-instance batches better
  • Customer Communication: Should have set expectation for potential 15-min read-only period
  • Dependency Analysis: Relied too heavily on Compute Optimizer without validating architecture dependencies

Key Takeaways

  • Compute Optimizer is a starting point, not gospel: Always validate recommendations in staging, especially for architecture changes
  • Autoscaling = Cost + Capacity: Don't frame it as just cost savings; it also enables growth without infrastructure expansion
  • NumPy and ARM: Python scientific libraries often have x86-compiled binaries; rebuild containers with ARM-native packages
  • Load testing builds confidence: CTO was initially skeptical of autoscaling; 8x load test results were decisive
  • Document rollback procedures BEFORE changes: Our 45-minute rollback was only possible because we had documented procedures

Applicability to Similar Scenarios

This approach works best for:

  • Seasonal/variable workloads where traffic patterns are predictable but infrastructure is static
  • SaaS companies between seed and Series B where every dollar of runway matters
  • Over-provisioned environments with CPU utilization below 40%
  • Teams willing to invest 2-3 days in load testing and monitoring setup
  • Modern architectures already using containerization or easily containerizable workloads

Not recommended for: Compliance-heavy workloads requiring instance-level certification, monolithic applications with hard dependencies on specific instance families, or teams without staging environments.

Similar Challenge?

If your AWS infrastructure is over-provisioned for peak traffic but idle most of the time, we can help you optimize for both cost and performance.

Schedule a Free Assessment

2-week engagement • Read-only audit • Reversible changes • No commitment