Case Study: Scaling E-commerce Analytics 40% While Reducing EC2 Costs 36.8%
How we improved resource efficiency from 24% to 61% CPU utilization for a seasonal SaaS platform, enabling growth without proportional infrastructure costs.
At a Glance
Client Profile
- Industry: B2B SaaS serving e-commerce businesses
- Company Stage: Series B, $8M ARR, 85 employees
- Tech Stack: 120 EC2 instances (m5.xlarge to m5.4xlarge)
- Timeline: 2-week engagement, November 2024
Key Challenge
Infrastructure over-provisioned for Black Friday peak traffic was running at 24% CPU utilization for 10 months of the year, draining runway unnecessarily.
Primary Pain Point: CFO concerned about burn rate while CTO worried about maintaining sub-200ms p95 API latency during seasonal spikes.
The Situation
This e-commerce analytics platform provides real-time inventory insights and demand forecasting for mid-market retailers. Following their Series B funding round, they were experiencing 15% month-over-month customer growth but burning through runway 30% faster than projected.
The engineering team had scaled infrastructure aggressively for Black Friday 2023, successfully handling a 12x traffic spike with zero downtime. However, in the 10 months since, those m5.4xlarge instances sat mostly idle while still costing $11,800 per month.
Their CFO reached out after seeing AWS as their second-largest operational expense after payroll. The CTO's primary concern: "We can't sacrifice performance. Our SLA guarantees sub-200ms API response times, and we're competing against companies with 10x our resources."
Business Context
- Revenue Model: Usage-based pricing ($0.02 per API call) with 3,500 paying customers
- Traffic Pattern: 2-3x daily peaks during business hours, 12x seasonal spikes (Q4)
- Funding Runway: 14 months remaining at current burn rate
- Engineering Team: 12 developers, 2 DevOps engineers
- Competition: Competing with well-funded Series C/D companies spending 3x on infrastructure
Discovery Phase
Week 1: Data Collection & Analysis
We deployed read-only audit access and analyzed 90 days of CloudWatch metrics, Cost Explorer data, and AWS Compute Optimizer recommendations.
Infrastructure Inventory
Instance Type | Count | Monthly Cost | Avg CPU |
---|---|---|---|
m5.xlarge | 85 | $8,840 | 22% |
m5.2xlarge | 30 | $2,520 | 28% |
m5.4xlarge | 5 | $440 | 18% |
Total | 120 | $11,800 | 24% |
Key Findings
- CPU Utilization: Average 24%, peak 38% (outside Black Friday week)
- Memory Utilization: Average 41%, never exceeded 62%
- Architecture: Static 120 instances, no autoscaling configured
- Commitment: Zero Reserved Instances or Savings Plans (100% On-Demand)
- Performance: p95 API latency consistently at 145ms (well within SLA)
Traffic Pattern Analysis
Analyzed 90 days of CloudWatch metrics to understand traffic patterns:
- Baseline Traffic: 40-50 instances needed (33% of current capacity)
- Daily Peak (9am-5pm EST): 80-90 instances needed (75% of current capacity)
- Quarterly Peak (Black Friday): 120+ instances needed (100%+ of current capacity)
- Off-hours (11pm-6am EST): 30-35 instances sufficient (25% of current capacity)
Key Insight: Static provisioning for peak meant 75% of capacity sat idle most of the time. Autoscaling could reduce baseline to 40 instances while maintaining peak capacity.
Unexpected Discovery: Compute Optimizer recommended Graviton-based m6g instances with 48% better price-performance. However, one critical legacy service had an x86-only dependency we'd need to address.
The Challenge: Graviton Migration Rollback
Our initial optimization plan targeted aggressive right-sizing plus migration to Graviton-based instances. On paper, this would deliver 43% savings. Reality had other plans.
What Happened
Day 3 (Tuesday 14:00): Migrated 40 instances to m6g.xlarge without issues. Monitoring showed improved performance and 35% cost reduction on this subset.
Day 4 (Wednesday 09:30): Attempted migration of remaining 80 instances, including the analytics processing service. Within 15 minutes, customer dashboards showed data processing delays. The legacy Python data pipeline used NumPy compiled for x86 architecture — it ran on ARM64 (Graviton) but with 4x slower performance.
Day 4 (Wednesday 10:15): Immediate rollback decision. Reverted 12 instances back to m5.xlarge within 45 minutes. Customer impact: 30-minute data delay, no SLA breach, proactive notification sent.
Week 2: Rebuilt the data pipeline container with ARM-optimized NumPy, tested in staging, successfully migrated those 12 instances on Day 11.
Root Cause Analysis
Why did this happen despite Compute Optimizer's recommendation?
- The Issue: NumPy binary compiled for x86 (Intel/AMD) ran on ARM64 (Graviton) through emulation layer
- Performance Impact: Matrix operations (core to their analytics) were 4x slower due to emulation overhead
- The Fix: Rebuilt container with ARM-native NumPy from PyPI (pip install numpy on ARM instance)
- Post-Fix Performance: ARM-native NumPy actually 18% faster than x86 version on Graviton
Lesson Learned: "Compute Optimizer says it will work" ≠"It will work for your specific workload." Always validate architecture-specific dependencies in staging first, even when AWS tools recommend the change.
Implementation Approach
Phase 1: Autoscaling Configuration (Days 1-3)
Application Load Balancer Setup
- Target Groups: Created separate target groups for API (port 8080) and Admin (port 8081) services
- Health Checks: Configured /health endpoint checks every 10 seconds with 2 consecutive success threshold
- Deregistration Delay: Set to 60 seconds to allow in-flight requests to complete during scale-down
Auto Scaling Group Configuration
- Baseline Capacity: Min 40 instances, desired 40, max 120
- Target Tracking: Maintain 60% average CPU utilization
- Cooldown Period: 300 seconds scale-up, 600 seconds scale-down (prevent thrashing)
- Health Check Grace Period: 180 seconds to allow instance initialization
- Termination Policy: OldestInstance + ClosestToNextInstanceHour (cost optimization)
Scheduled Scaling for Dev/Staging
- Business Hours (7am-7pm EST): Min 10 instances
- Off-Hours (7pm-7am EST): Min 2 instances
- Weekends: Min 1 instance (monitoring only)
- Monthly Savings: ~$800 on non-production environments
Phase 2: Instance Migration (Days 4-8)
Graviton Migration Strategy
Phased approach to minimize risk:
- Day 3-4: Migrated 40 web tier instances to m6g.large (success)
- Day 4: Attempted migration of 80 remaining instances (rollback required for 12 instances)
- Days 5-7: Rebuilt data pipeline container with ARM-native dependencies
- Day 8: Load tested rebuilt container in staging (passed with 18% performance improvement)
- Day 11: Successfully migrated final 12 instances to m6g.xlarge
Final Instance Configuration
Service | Instance Type | Count (Baseline) | Monthly Cost |
---|---|---|---|
Web Tier | m6g.large | 28 (scale 28-90) | $1,680 |
API Tier | m6g.large | 34 (scale 34-60) | $2,040 |
Analytics Pipeline | m6g.xlarge | 12 (static) | $1,440 |
Legacy Service | m5.2xlarge | 6 (static) | $2,300 |
Total (Baseline) | 40-120 (ASG) | $7,460 |
Phase 3: Monitoring & Validation (Days 9-14)
CloudWatch Dashboards
- Cost Tracking: Daily EC2 spend, month-to-date tracking vs. forecast
- Scaling Metrics: ASG desired/running/pending capacity, scaling activity timeline
- Performance: API latency (p50/p95/p99), error rates, throughput
- Resource Utilization: CPU/memory by instance type, network I/O
Load Testing Results
Simulated Black Friday traffic (8x baseline) using Apache JMeter:
- Baseline Load: 2,000 requests/second → p95 latency 148ms
- 2x Load: 4,000 requests/second → p95 latency 152ms (ASG scaled to 62 instances)
- 5x Load: 10,000 requests/second → p95 latency 165ms (ASG scaled to 95 instances)
- 8x Load: 16,000 requests/second → p95 latency 178ms (ASG scaled to 118 instances)
Result: ASG successfully maintained sub-200ms p95 latency at 8x traffic. Well within 200ms SLA target.
Runbook Documentation
Created operational playbooks for seasonal events:
- Pre-Event Preparation: Increase max capacity, validate health checks, warm up cache
- During Event: Monitor scaling activity, track error rates, manual override procedures
- Post-Event: Scale-down monitoring, cost analysis, incident retrospective
Results in Detail
Cost Savings Breakdown
Before Optimization
120 m5 instances (static) | $11,800/mo |
Average CPU Utilization | 24% |
Commitment Discounts | 0% |
Autoscaling | None |
After Optimization
40-120 m6g instances (ASG) | $7,460/mo |
Average CPU Utilization | 61% |
Graviton Price Advantage | 20% |
Autoscaling | Active |
Performance Impact
Metric | Before | After | Change |
---|---|---|---|
Average CPU Utilization | 24% | 61% | +154% |
p50 API Latency | 82ms | 78ms | -5% |
p95 API Latency | 145ms | 148ms | +2% |
p99 API Latency | 198ms | 192ms | -3% |
Error Rate | 0.03% | 0.02% | -33% |
Availability (uptime) | 99.97% | 99.98% | +0.01% |
Business Value
Immediate Financial Impact
- $52,080 annual savings = 4.2 months additional runway at current burn rate
- Funds 1.3 additional mid-level engineering FTEs
- Reduced AWS from 37% to 23% of operational expenses
- Zero upfront capital investment required
Growth Enablement
- Traffic Capacity: Can now handle 40% traffic growth with only 15% cost increase (vs. 40% under old static model)
- Seasonal Readiness: Autoscaling proven to handle 8x traffic spikes without over-provisioning year-round
- Customer Acquisition: Lower infrastructure costs improve unit economics, enabling more aggressive customer acquisition
Operational Improvements
- DevOps Efficiency: Eliminated manual instance management, freeing 8 hours/week
- Incident Response: Documented rollback procedures reduced MTTR by 60%
- Monitoring: CloudWatch dashboards provide real-time cost visibility
- Documentation: Seasonal event playbooks enable junior engineers to manage scaling
Lessons Learned
✓ What Worked
- Phased Migration: Starting with 40 instances caught the NumPy issue before full rollout
- Read-Only Audit: Building trust with engineering team from day 1 enabled fast approvals
- Graviton Price-Performance: 20% cost reduction + 18% performance improvement on data pipeline
- Autoscaling Strategy: Delivered both cost savings AND growth capacity
- Load Testing: Validated 8x capacity before Black Friday, giving CTO confidence
✗ What Didn't Work
- Staging Validation: Should have tested data pipeline in staging with ARM before production
- Migration Batch Size: 80 instances was too aggressive; 20-instance batches better
- Customer Communication: Should have set expectation for potential 15-min read-only period
- Dependency Analysis: Relied too heavily on Compute Optimizer without validating architecture dependencies
Key Takeaways
- Compute Optimizer is a starting point, not gospel: Always validate recommendations in staging, especially for architecture changes
- Autoscaling = Cost + Capacity: Don't frame it as just cost savings; it also enables growth without infrastructure expansion
- NumPy and ARM: Python scientific libraries often have x86-compiled binaries; rebuild containers with ARM-native packages
- Load testing builds confidence: CTO was initially skeptical of autoscaling; 8x load test results were decisive
- Document rollback procedures BEFORE changes: Our 45-minute rollback was only possible because we had documented procedures
Applicability to Similar Scenarios
This approach works best for:
- Seasonal/variable workloads where traffic patterns are predictable but infrastructure is static
- SaaS companies between seed and Series B where every dollar of runway matters
- Over-provisioned environments with CPU utilization below 40%
- Teams willing to invest 2-3 days in load testing and monitoring setup
- Modern architectures already using containerization or easily containerizable workloads
Not recommended for: Compliance-heavy workloads requiring instance-level certification, monolithic applications with hard dependencies on specific instance families, or teams without staging environments.
Similar Challenge?
If your AWS infrastructure is over-provisioned for peak traffic but idle most of the time, we can help you optimize for both cost and performance.
Schedule a Free Assessment2-week engagement • Read-only audit • Reversible changes • No commitment