Case Study: From EKS Cost Blindness to 31.2% Savings with Team-Level Attribution
How we reduced Kubernetes costs from $11,340 to $7,800/month while establishing cost ownership across 8 engineering teams running 47 microservices
At a Glance
Client Profile
- Industry: Marketing automation platform
- Company Stage: Series B, $86,000/month AWS spend
- Infrastructure: 47 microservices across 3 EKS clusters
- Timeline: 2-week engagement, January 2025
Business Context
Series B pressure to demonstrate unit economics. Board asking "what does it cost to serve each customer?" — engineering couldn't answer. Need to establish cost culture before Series C discussions.
Primary Pain Point: Complete lack of cost visibility. No team knew what their services cost. EKS spend growing 18% quarterly with no accountability mechanism.
The Situation
The client's Kubernetes infrastructure had evolved from a single monolith to 47 microservices over 3 years. Their architecture spanned 3 EKS clusters:
- Production cluster: 28 services, 18 nodes (m5.2xlarge)
- Staging cluster: 19 services, 8 nodes (m5.xlarge)
- Shared services cluster: Database proxies, monitoring, logging
Their engineering team operated in squad model with 8 product squads (each owning 4-8 microservices), a platform team (infrastructure, CI/CD, monitoring), and a data team (analytics pipelines).
The Cost Blindness Problem
Nobody knew what anything cost. Platform team knew total EKS spend: $11,340/month. But there was no breakdown by team, service, or environment. No visibility into which services were expensive vs. cheap. No mechanism to charge back costs to product teams. Resource requests set arbitrarily ("let's ask for 2 CPU cores to be safe").
Result: Tragedy of the commons — every team over-requested resources because there was no cost feedback loop.
Discovery Phase
Business Context
- Revenue Model: $350/month per seat, 4,200 enterprise customers, $14.7M ARR
- Growth Stage: Series B ($28M raised), preparing for Series C in 12-18 months
- Board Pressure: "Show us unit economics" — cost to serve each customer unclear
- Engineering Structure: 8 product squads (5-8 engineers each), 1 platform team (6 engineers), 1 data team (4 engineers)
- Deployment Model: Microservices architecture, 10-12 deployments per day across 47 services
- Key Metric: Monthly Active Users (MAU) growing 22% QoQ, infrastructure costs growing 18% QoQ
Week 1: Baseline Analysis & Kubecost Installation
We installed Kubecost (open-source edition) to establish visibility:
Key Finding: 18 services (38%) had zero resource labels — completely unattributable.
The Efficiency Gap
Average pod was requesting:
- 2.8x more CPU than actually used
- 2.7x more RAM than actually used
This over-requesting caused Kubernetes scheduling inefficiency, more nodes provisioned than needed, and higher AWS bills.
The Challenge: VPA Over-Correction and Service Instability
What Went Wrong
We deployed Vertical Pod Autoscaler (VPA) in recommendation mode to suggest right-sized resource requests. After 72 hours of data collection, VPA recommended reducing resource requests across 31 services. We applied VPA recommendations to 12 services in staging on Wednesday afternoon.
Thursday morning disaster:
- 5 services experienced OOMKilled errors during load testing
- 3 services had CPU throttling (P99 latency increased 240%)
- 1 service crashed repeatedly (memory leak under load)
Root Cause
VPA recommendations based on 72-hour average usage, but:
- Load testing hadn't run during VPA observation period
- Weekly batch jobs (Sunday nights) require 3.5x normal RAM
- Memory leak in user-sync service only appeared after 6+ hours uptime
- JVM heap tuning in analytics-pipeline required headroom beyond actual usage
The Reversal
Within 4 hours:
- Reverted all 12 services to original resource requests
- All services stabilized
- Zero production impact (staging caught everything)
The Fix
We changed our approach to VPA recommendations:
- Extended observation period: 14 days minimum (not 72 hours)
- Load testing inclusion: Ran full load tests during VPA observation
- P95 recommendations, not average: VPA configured to recommend P95 usage + 20% buffer
- Service-by-service rollout: Test one service for 48 hours before next
- Memory leak detection: Identified user-sync memory leak (fixed by dev team)
- JVM-specific handling: Java services get +35% heap headroom beyond actual usage
Lesson: VPA recommendations are great starting points, but blind application will break things. Always account for periodic workloads, load testing patterns, and language-specific memory management.
Implementation Approach
Phase 1: Visibility Foundation (Week 1)
Step 1: Kubecost Deployment
Installed Kubecost using Helm in all 3 EKS clusters:
- Cluster 1 (Production): Kubecost with 7-day retention, Prometheus integration
- Cluster 2 (Staging): Same configuration as production
- Cluster 3 (Shared Services): Monitoring only (no cost attribution needed)
- Storage: 50GB PVC per cluster for Kubecost data retention
Step 2: Label Strategy Design
Designed standardized label taxonomy with engineering leadership:
Label | Example Values | Purpose |
---|---|---|
team | campaign-squad, analytics-squad, platform | Team ownership & chargeback |
service | email-processor, api-gateway, user-sync | Service-level cost tracking |
environment | production, staging, dev | Environment cost segregation |
cost-center | engineering, data, infrastructure | Finance department mapping |
Step 3: Label Implementation
Coordinated with 8 teams to label all deployments:
- Day 1-2: Platform team labeled shared infrastructure (12 services)
- Day 3-5: Squad leads labeled their services (35 services)
- Challenge: 3 legacy services had no clear owners — assigned to platform team temporarily
- Validation: Script to detect unlabeled pods, Slack alerts for new deployments without labels
Step 4: Dashboard Setup
Created custom Grafana dashboards fed by Kubecost API:
- Executive dashboard: Total EKS cost, cost by team, cost trend (7/30/90 day)
- Team dashboards: Each squad's cost breakdown by service, efficiency metrics
- Service dashboards: Individual service cost, resource utilization, right-sizing recommendations
Phase 1 Result: 94% cost attribution (3 legacy services = 6% unattributed)
Phase 2: Right-Sizing with VPA (Week 2)
Step 1: VPA Deployment & Configuration
After learning from initial VPA failure:
- VPA installation: Deployed VPA v0.13 with
updateMode: "Off"
(recommendation only) - Observation period: 14 days minimum (not 72 hours)
- Configuration: P95 target utilization + 20% buffer, not average
- Exclusions: Stateful services (databases, caches) excluded from VPA recommendations
Step 2: Load Testing Integration
Ensured VPA observed real-world peak usage:
- Load tests: Ran weekly load tests during VPA observation period
- Batch jobs: Manually triggered Sunday night batch jobs mid-week for observation
- Traffic replay: Replayed production traffic patterns in staging (using GoReplay)
Step 3: Recommendation Review Process
Implemented approval workflow before applying VPA recommendations:
Service | Current CPU | VPA Rec. | Final | Notes |
---|---|---|---|---|
email-processor | 2000m | 650m | 750m | Added 15% buffer |
analytics-pipeline (JVM) | 1500m | 850m | 1150m | JVM heap needs +35% |
user-sync | 1000m | 400m | 1000m | Memory leak found — keep current |
campaign-executor | 800m | 550m | 550m | Applied as-is |
Step 4: Phased Rollout
Rolled out service-by-service with validation gates:
- Day 1: Applied to 5 lowest-risk services (staging only)
- Day 2: Validated performance for 24 hours before proceeding
- Day 3-5: Rolled out 12 services in staging, 8 services in production
- Day 6-7: Applied remaining 22 services after validation
Phase 3: Node Optimization
Step 1: Capacity Analysis
After right-sizing pods, calculated actual node requirements:
- Production: 520 vCPU requests → 320 vCPU requests (38% reduction)
- Node calculation: 320 vCPU / (8 vCPU per m5.2xlarge * 0.75 utilization target) = 11.3 nodes
- Decision: Target 12 nodes (was 18), giving 68% target utilization
Step 2: Node Removal Strategy
Safely removed excess nodes:
- Cordon nodes: Marked 6 nodes as unschedulable
- Graceful drain: Drained pods with 5-minute termination grace period
- Pod rescheduling: Kubernetes moved pods to remaining nodes
- Validation: Monitored P99 latency, error rates for 48 hours
- Termination: Terminated drained nodes after validation
Step 3: Cluster Autoscaler Configuration
Enabled autoscaling with proper bounds:
Production Cluster Autoscaler Config:
min-nodes: 8 # Handle 50% utilization at minimum max-nodes: 18 # Allow 2.25x scale for traffic spikes scale-down-delay-after-add: 10m scale-down-unneeded-time: 10m scale-down-utilization-threshold: 0.65
Phase 4: Cost Culture Establishment
Step 1: Automated Reporting
Built cost visibility into team workflows:
- Weekly Slack reports: Python script queries Kubecost API, posts to team channels
- Report format: "Campaign Squad: $2,140 this week (+8% vs last week). Top services: campaign-executor ($850), campaign-scheduler ($580)"
- Budget alerts: Automated alerts when team exceeds 110% of monthly budget
Step 2: Dashboard Distribution
Provided each team with self-service visibility:
- Grafana access: SSO integration, team-scoped dashboards
- Cost trend charts: 7/30/90 day rolling windows
- Service comparison: Cost per service within team's portfolio
- Right-sizing opportunities: VPA recommendations visible to teams
Step 3: Monthly Cost Reviews
Established recurring cost review meetings:
- Engineering leadership: Monthly review with VPE, platform lead, finance
- Team reviews: Quarterly reviews with each squad
- Budget allocation: Each squad assigned quarterly Kubernetes budget
- Accountability: Cost growth requires justification (new features, user growth, etc.)
Step 4: Knowledge Transfer
Trained platform team to maintain cost optimization:
- Documentation: Runbook for VPA recommendation review process
- Training sessions: 2-hour workshop on Kubecost, resource requests, right-sizing
- Handoff: Platform engineer now owns cost monitoring (2 hours/week)
Results in Detail
Cost Savings Breakdown
Component | Before | After | Monthly Savings |
---|---|---|---|
Production nodes (18 → 12 m5.2xlarge) | $7,344 | $4,896 | −$2,448 (33.3%) |
Staging nodes (8 → 5 m5.xlarge) | $2,764 | $1,728 | −$1,036 (37.5%) |
Shared services (unchanged) | $1,232 | $1,176 | −$56 (4.5%) |
Total EKS | $11,340 | $7,800 | −$3,540 (31.2%) |
Efficiency Improvements
- Node CPU utilization: 42% → 68% (+26pp)
- Node RAM utilization: 35% → 64% (+29pp)
- Pod resource request accuracy: 36% → 82%
- Over-requested CPU: 2.8x → 1.2x
- Over-requested RAM: 2.7x → 1.3x
- Cost attribution: 6% → 94%
Business Value
Immediate Impact
- $3,540/month = $42,480 annual savings
- Engineering teams now have cost feedback loop
- Platform team can justify infrastructure investments with cost data
Long-term Value
- Cost culture established: Teams now consider cost when designing services
- Unit economics visibility: Can calculate cost per customer (was impossible before)
- Budget accountability: Each squad has quarterly Kubernetes budget
- Architectural decisions informed by cost: Campaign squad considering consolidating 3 services into 1 after seeing combined cost
Real Example: Campaign Squad Cost Awareness
Campaign squad saw their services cost $2,140/mo (18.8% of total EKS spend). Realized campaign-scheduler was $580/mo for a service handling 200 req/day. Refactored campaign-scheduler to use Lambda (now costs $14/mo). Squad voluntarily reduced their costs by $566/mo through architectural optimization.
This is the power of cost visibility — teams self-optimize when they see the numbers.
Lessons Learned
What Worked
- Kubecost as visibility tool: Open-source edition sufficient, no enterprise features required
- Label standardization: Enforcing consistent labels made attribution possible
- Team engagement: Weekly cost reports in Slack created positive peer pressure
- Staging-first rollout: VPA disaster caught in staging, zero production impact
What Didn't Work
- 72-hour VPA observation: Way too short, missed periodic workloads
- Blind VPA application: Caused 5 OOMKilled errors and 3 CPU throttling incidents
- Assumption about steady-state workloads: Many services have weekly/monthly batch jobs that spike resource usage
Key Takeaways
- Cost visibility drives behavior change: Teams voluntarily optimized when they saw their spend
- VPA is powerful but dangerous: Great recommendations, but requires careful validation before application
- Observation period matters: 14 days minimum to capture periodic workloads
- Language-specific tuning: JVM services need different resource headroom than Go services
- Cost culture > cost optimization: Establishing accountability delivers compounding returns
Need Kubernetes Cost Visibility?
If your EKS infrastructure has grown without cost attribution, we can help establish team-level accountability and optimize resource utilization.
Schedule a Free Assessment2-week engagement • Read-only audit • Reversible changes • No commitment