Case Study: Optimizing $156K Monthly Compute Spend with Hybrid Commitment Strategy
How we reduced compute costs by 23.8% ($37,100/year saved) by rebalancing Reserved Instances and Savings Plans for a hybrid workload with steady-state and burst patterns
At a Glance
Client Profile
- Industry: E-commerce marketplace platform
- Company Stage: Series B, $312,000/month AWS spend
- Scale: 3.2M buyers and 48,000 sellers
- Timeline: 2-week engagement, January 2025
Business Context
Series B capital efficiency focus: Board wants 30% reduction in AWS spend before Series C. Compute is largest cost category (50% of total AWS spend). Quick wins needed to demonstrate operational discipline.
Primary Pain Point: Inconsistent commitment strategy: 58% coverage with mix of 1-year and 3-year Reserved Instances purchased ad-hoc over 4 years. Missing out on 42% of potential commitment savings.
The Situation
The client's compute infrastructure included:
- Database tier: 18 RDS instances (db.r5.2xlarge to db.r5.8xlarge)
- Application tier: 85 EC2 instances (m5.xlarge to c5.4xlarge)
- Search infrastructure: 12 Elasticsearch nodes (r5.2xlarge)
- Background jobs: 24 spot instances + 12 on-demand instances (variable)
Business Context
- Revenue Model: Transaction-based (2.5% fee) with 3.2M buyers and 48,000 sellers = $28M ARR
- Growth Stage: Series B ($42M raised), 24 months to Series C
- Team Structure: 240 total employees (38 engineering, 12 DevOps/SRE, 45 seller success)
- Key Business Metrics: 99.9% marketplace uptime, <500ms search latency, 4.2% GMV take rate
- Critical Constraints: Must maintain 24/7 database availability, search downtime = immediate GMV loss
- Strategic Pressure: Board demands 30% AWS cost reduction before Series C — compute is 50% of AWS spend ($156K/month)
Current Commitment Strategy
Inherited from 4 years of ad-hoc purchasing:
The Inefficiency
Analyzing 12 months of billing data revealed:
- RI waste: 6 Standard RIs unused (instances downsized, but RIs locked)
- Under-commitment: $5,420/month on-demand spend on steady-state workloads
- No flexibility: Standard RIs locked to specific instance types/sizes
- Fragmented strategy: Purchased reactively whenever someone remembered
- Missing Savings Plans: Lambda, Fargate, and cross-region EC2 not covered by RIs
AWS Cost Explorer Recommendations suggested $3,090/month savings opportunity.
Discovery Phase
Week 1: Commitment Audit & Workload Analysis
We analyzed 12 months of EC2, RDS, and Lambda usage:
Infrastructure Inventory
Workload Type | Instance Count | Instance Type | Monthly Cost | Current Coverage |
---|---|---|---|---|
RDS Primary | 6 | db.r5.4xlarge | $3,240 | 4 RIs (67%) |
RDS Read Replicas | 12 | db.r5.2xlarge | $2,040 | 8 RIs (67%) |
Application servers | 42 | m5.xlarge | $2,520 | 28 RIs (67%) |
Search (Elasticsearch) | 12 | r5.2xlarge | $1,440 | 12 RIs (100%) |
Web servers (burst) | 28-65 | m5.2xlarge | $1,680 | 12 RIs (43%) |
Background workers | 12-36 | c5.xlarge | $840 | On-demand (0%) |
Lambda | -- | Various | $840 | On-demand (0%) |
Fargate | -- | Various | $620 | On-demand (0%) |
Total | 112+ | -- | $13,220 | 58% coverage |
Note: Monthly costs shown are on-demand equivalent. Actual spend with current RIs: $13,020/month
Commitment Utilization
- Over-committed: 6 RIs unused (downsized from r5.4xlarge → r5.2xlarge)
- Under-committed: 42% of compute on-demand
- No coverage: Lambda ($840/mo) and Fargate ($620/mo)
The Challenge: Over-Commitment on Burst Workload
What Went Wrong
Based on Cost Explorer recommendations, we initially modeled a strategy:
- Purchase Compute Savings Plan covering 95% of baseline usage
- Let expiring RIs roll off (don't renew)
- Use Savings Plan discount for both steady-state and burst workloads
We purchased $6,200/month Compute Savings Plan (1-year, All Upfront = $74,400 upfront payment).
Two weeks later, problem emerged:
During off-peak season (post-holidays, January-February):
- Flash sales frequency dropped 60% (2-3x per week → 1x per week)
- Burst workload scaled down to baseline
- Savings Plan over-committed: Using only $4,800/month of $6,200/month commitment
- Wasted commitment: $1,400/month not utilized = $16,800/year waste
Root Cause: Modeled commitment based on November-December billing (peak season), not annual average.
The Reversal
Can't reverse Savings Plan purchase (non-refundable), but can optimize around it:
- Accept the sunk cost: $74,400 already paid, can't recover
- Scale up workloads to use commitment: Moved some on-demand workloads to use Savings Plan discount (migrated Lambda functions to 24/7 Fargate, kept background workers running longer)
- Adjust future commitment strategy: Model based on P10 usage (low season), not peak
The Fix
After analysis, determined optimal commitment level:
Lesson: Model commitments based on minimum usage (P10), not average or peak. Commit to the floor, not the ceiling. Use on-demand for variability.
Implementation Approach
Phase 1: Audit Existing Commitments (Week 1)
Step 1: Inventory All Commitments
Used AWS CLI and Cost Explorer to generate comprehensive commitment inventory:
Results: 62 Reserved Instances identified (46 EC2, 16 RDS), zero Savings Plans, total committed spend: $7,880/month
Step 2: Analyze RI Utilization
Pulled 12 months of RI utilization data from Cost Explorer:
- High Utilization (>95%): db.r5.2xlarge (Elasticsearch), m5.xlarge (app servers) — 40 RIs
- Medium Utilization (70-95%): db.r5.4xlarge (RDS primary) — 4 RIs
- Low Utilization (<70%): r5.2xlarge (downsized from r5.4xlarge) — 6 RIs
- Wasted Commitment: $840/month on 6 unused RIs (instances downsized 8 months ago, RIs still active)
Step 3: Map Expiration Timeline
Created expiration calendar to identify renewal windows:
Expiration Date | Instance Type | Count | Monthly Value | Action Needed |
---|---|---|---|---|
Feb 2025 | db.r5.4xlarge | 4 | $1,080 | Renew or replace |
Apr 2025 | r5.2xlarge | 12 | $1,440 | Don't renew |
Jun 2025 | db.r5.2xlarge | 8 | $1,360 | Renew 3-year |
Aug 2025 | m5.xlarge | 28 | $1,680 | Replace with SP |
Jun 2026 | m5.2xlarge | 12 | $1,440 | No action (18mo) |
Step 4: Usage Pattern Classification
Analyzed 12-month CloudWatch and Cost & Usage Reports to classify workloads:
- Steady-state (commit with RIs): 72 instances running 24/7 at >95% uptime — ideal for 3-year RIs
- Variable (commit with Savings Plans): 40 instances scaling 12-36 instances — need flexibility of Compute SP
- Serverless (commit with Compute SP): Lambda + Fargate $1,460/month — can't use RIs, need Compute SP
- Burst (on-demand only): Peak scaling beyond baseline — keep on-demand for elasticity
Key Finding: $840/month wasted on unused RIs + $5,420/month on-demand spend that should be committed = $6,260/month opportunity
Phase 2: Purchase Strategy (Week 2)
Step 1: Model Commitment Scenarios
Used Cost Explorer Savings Plans Recommendations and custom modeling:
Modeling Approach: Modeled 3 scenarios (conservative, moderate, aggressive) based on P10, P50, P90 usage over 12 months.
- Conservative (P10): $4,200/month Compute SP + $1,800/month EC2 SP = 97% utilization guarantee
- Moderate (P50): $5,600/month Compute SP + $2,400/month EC2 SP = 90% utilization expected
- Aggressive (P90): $6,800/month Compute SP + $3,000/month EC2 SP = 75% utilization risk
Decision: Selected Conservative model to avoid over-commitment (learned from initial mistake — see Challenge section).
Step 2: Execute Purchases in Priority Order
Purchased commitments starting with highest certainty:
Priority | Commitment Type | Monthly Commit | Discount | Rationale |
---|---|---|---|---|
1 | RDS RIs (3yr, All Up) | $4,320 | 62% | 100% uptime guarantee |
2 | Compute SP (1yr, No Up) | $4,200 | 52% | Flexible, covers Lambda/Fargate |
3 | EC2 Instance SP (1yr, Partial) | $1,800 | 48% | EC2-specific, higher discount |
4 | RI Exchange (Convertible) | $840 | -- | Recover wasted Standard RIs |
Total New Commitments | $10,320 | 54% avg | -- |
Step 3: Exchange Unused Standard RIs
Recovered value from 6 unused Standard RIs by converting to Convertible RIs:
- Problem: 6x r5.4xlarge Standard RIs unused (instances downsized to r5.2xlarge 8 months ago)
- Limitation: Standard RIs can't be modified or exchanged once purchased
- Workaround: AWS Support allowed one-time exchange to Convertible RIs due to documented operational need
- Exchange: 6x r5.4xlarge → 12x r5.2xlarge Convertible RIs (same total compute capacity)
- Recovery: $840/month wasted commitment now 100% utilized
Note: Standard RI exchange to Convertible requires AWS Support case. Not guaranteed. In future, only purchase Convertible RIs for flexibility.
Step 4: Validate Purchase Impact
After purchases completed (24-48 hour activation):
- Upfront Payment: $155,520 (RDS 3-year All Upfront) + $10,800 (EC2 SP Partial Upfront) = $166,320 total
- Monthly Commitment: $10,320/month ($4,320 RDS RI + $4,200 Compute SP + $1,800 EC2 SP)
- Coverage Improvement: 58% → 89% commitment coverage
- On-Demand Remaining: $1,000/month (11% of compute, reserved for burst scaling)
Result: $3,090/month savings (23.8% reduction) with 98% commitment utilization (no waste)
Phase 3: Monitoring & Ongoing Optimization
CloudWatch Dashboard Setup
Created real-time commitment tracking dashboard:
- Utilization Metrics:
- Savings Plans utilization % (target: >97%)
- Reserved Instances utilization % by instance type
- On-demand spend as % of total compute (target: <15%)
- Weekly utilization trend (7-day moving average)
- Coverage Metrics:
- Total commitment coverage % (steady-state workloads)
- Coverage gap identification (on-demand spend that could be committed)
- Commitment expiration calendar (next 90 days)
- Cost Metrics:
- Daily compute spend (committed vs. on-demand)
- Month-to-date savings vs. all on-demand
- Savings Plans vs. Reserved Instances cost comparison
Automated Alerts
Configured CloudWatch Alarms and AWS Budgets for commitment monitoring:
- Low Utilization Alert: Triggers if Savings Plans utilization < 90% for 3 consecutive days
- High On-Demand Alert: Triggers if on-demand spend > 15% of total compute (indicates under-commitment)
- Expiration Reminder: Triggers 90 days before any RI/SP expires (allows renewal planning)
- Cost Anomaly: Triggers if daily compute cost deviates > 20% from 7-day average
Weekly Commitment Report
Automated Slack report every Monday morning:
Quarterly Commitment Review
Scheduled quarterly reviews to optimize commitment strategy:
- Q1 2025 (Feb): Review RDS RI expirations, renew 12x db.r5.2xlarge for 3 years (confirmed 99.9% utilization)
- Q2 2025 (May): Evaluate Compute SP increase from $4,200 → $4,800 if Lambda usage grows >15%
- Q3 2025 (Aug): Decision point: Renew expiring m5.xlarge RIs or replace with Savings Plans?
- Q4 2025 (Nov): Annual review: Model next year's commitment strategy based on growth trajectory
Ongoing Optimization: Commitment utilization monitored weekly. In first 60 days post-implementation, maintained 97%+ utilization with zero under-commitment issues (learned from initial over-commitment mistake).
Results in Detail
Cost Savings Breakdown
Component | Before | After | Monthly Savings |
---|---|---|---|
RDS (on-demand → RIs) | $5,280 | $2,016 | −$3,264 (61.8%) |
EC2 steady-state (on-demand → Compute SP) | $4,840 | $2,904 | −$1,936 (40.0%) |
EC2 burst (on-demand → EC2 Instance SP) | $1,680 | $1,260 | −$420 (25.0%) |
Lambda/Fargate (on-demand → Compute SP) | $1,460 | $1,050 | −$410 (28.1%) |
Unused RIs recovered | −$840 | $0 | +$840 |
Total Compute | $13,020 | $9,930 | −$3,090 (23.8%) |
Annual savings: $3,090/month × 12 = $37,100/year
Commitment Coverage Improvement
Before
- Total compute spend: $13,020/month
- Committed spend: $7,880/month (60.5%)
- On-demand spend: $5,140/month (39.5%)
- Coverage: 58%
After
- Total compute spend: $9,930/month
- Committed spend: $8,930/month (89.9%)
- On-demand spend: $1,000/month (10.1%)
- Coverage: 89%
Business Value
Immediate Impact
- $3,090/month = $37,100 annual savings (23.8% reduction)
- Improved gross margins by 1.2%
- Demonstrated Series B capital efficiency for Board
Long-term Value
- Predictable costs: 89% of compute now fixed-price (easier forecasting)
- Flexibility: Compute SP covers Lambda/Fargate (supports serverless migration)
- Commitment calendar: Proactive renewal strategy prevents future waste
- Scalability: 11% on-demand buffer handles growth without re-commitment
Real Example: Flash Sale Economics
Before optimization: Flash sale (2-hour event) required 73 additional EC2 instances at $124/event (all on-demand). 8 sales per month = $992/month.
After optimization: Same 73 instances cost $58/event (40% covered by unused Compute SP capacity, 60% on-demand). 8 sales per month = $464/month.
53% reduction in flash sale infrastructure cost — commitment strategy improves burst economics too.
Lessons Learned
What Worked
- Hybrid commitment strategy: RDS RIs (3-year) for predictable, Savings Plans (1-year) for flexible
- P10 usage modeling: Committing to minimum usage (not average) prevented over-commitment
- Convertible RI exchanges: Recovered $840/month wasted on unused Standard RIs
- Serverless coverage: Compute SP covered Lambda/Fargate (RIs can't)
What Didn't Work
- Initial over-commitment: Purchased $6,200/month SP based on peak season, wasted $1,400/month off-season
- Cost Explorer recommendations: Blindly following AWS recommendations led to over-commitment
- All Upfront for variable workloads: Locked in $74,400 with no flexibility
Key Takeaways
- Commit to the floor, not the ceiling: Model commitments on P10 usage, not average or peak
- Understand your workload: Steady-state vs. burst vs. serverless require different strategies
- Savings Plans ≠Reserved Instances: SP more flexible but also more dangerous (easier to over-commit)
- Payment flexibility matters: No Upfront costs more per hour, but provides flexibility to adjust
- Monitor utilization obsessively: 97%+ utilization is optimal, < 90% means over-commitment
Need Commitment Optimization?
If your AWS infrastructure has grown with ad-hoc Reserved Instance purchases, we can help rebalance your commitment strategy for optimal coverage and utilization.
Schedule a Free Assessment2-week engagement • Read-only audit • High-confidence commitments only