eks kubernetes cost-attribution B2B SaaS (Marketing Automation) • Series B, 120-200 employees

Case Study: From EKS Cost Blindness to 31.2% Savings with Team-Level Attribution

How we reduced Kubernetes costs from $11,340 to $7,800/month while establishing cost ownership across 8 engineering teams running 47 microservices

Monthly AWS Spend

$86,000

Cost Reduction

31.2%

Timeline

2 weeks

Published

Tue Jan 14 2025

At a Glance

Client Profile

Industry: Marketing automation platform
Company Stage: Series B, $86,000/month AWS spend
Infrastructure: 47 microservices across 3 EKS clusters
Timeline: 2-week engagement, January 2025

Business Context

Series B pressure to demonstrate unit economics. Board asking "what does it cost to serve each customer?" — engineering couldn't answer. Need to establish cost culture before Series C discussions.

Primary Pain Point: Complete lack of cost visibility. No team knew what their services cost. EKS spend growing 18% quarterly with no accountability mechanism.

31.2%

Monthly EKS Cost Reduction

$11,340 → $7,800/month

42% → 68%

Node Utilization

+26pp efficiency gain

0% → 94%

Cost Visibility

47 services attributed

The Situation

The client's Kubernetes infrastructure had evolved from a single monolith to 47 microservices over 3 years. Their architecture spanned 3 EKS clusters:

Production cluster: 28 services, 18 nodes (m5.2xlarge)
Staging cluster: 19 services, 8 nodes (m5.xlarge)
Shared services cluster: Database proxies, monitoring, logging

Their engineering team operated in squad model with 8 product squads (each owning 4-8 microservices), a platform team (infrastructure, CI/CD, monitoring), and a data team (analytics pipelines).

The Cost Blindness Problem

Nobody knew what anything cost. Platform team knew total EKS spend: $11,340/month. But there was no breakdown by team, service, or environment. No visibility into which services were expensive vs. cheap. No mechanism to charge back costs to product teams. Resource requests set arbitrarily ("let's ask for 2 CPU cores to be safe").

Result: Tragedy of the commons — every team over-requested resources because there was no cost feedback loop.

Discovery Phase

Business Context

Revenue Model: $350/month per seat, 4,200 enterprise customers, $14.7M ARR
Growth Stage: Series B ($28M raised), preparing for Series C in 12-18 months
Board Pressure: "Show us unit economics" — cost to serve each customer unclear
Engineering Structure: 8 product squads (5-8 engineers each), 1 platform team (6 engineers), 1 data team (4 engineers)
Deployment Model: Microservices architecture, 10-12 deployments per day across 47 services
Key Metric: Monthly Active Users (MAU) growing 22% QoQ, infrastructure costs growing 18% QoQ

Week 1: Baseline Analysis & Kubecost Installation

We installed Kubecost (open-source edition) to establish visibility:

Initial Assessment (7-day baseline):

├─ 47 services running across 3 clusters

├─ 26 nodes total (780 vCPUs, 3,120 GB RAM)

├─ Resource requests: 520 vCPUs, 1,840 GB RAM (67% of node capacity)

├─ Actual usage: 185 vCPUs, 680 GB RAM (36% of resource requests)

└─ Cost allocation: 6% attributed, 94% unallocated

Node Utilization:

├─ Production: 41% CPU, 38% RAM

├─ Staging: 35% CPU, 29% RAM

└─ Shared services: 58% CPU, 62% RAM (most efficient!)

Top 5 Expensive Services (estimated):

1. email-processor: ~$1,840/mo (no team ownership label)

2. analytics-pipeline: ~$1,520/mo (Data team)

3. api-gateway: ~$980/mo (Platform team)

4. campaign-executor: ~$850/mo (Campaign squad)

5. user-sync: ~$720/mo (Infrastructure squad)

Key Finding: 18 services (38%) had zero resource labels — completely unattributable.

The Efficiency Gap

Average pod was requesting:

2.8x more CPU than actually used
2.7x more RAM than actually used

This over-requesting caused Kubernetes scheduling inefficiency, more nodes provisioned than needed, and higher AWS bills.

The Challenge: VPA Over-Correction and Service Instability

What Went Wrong

We deployed Vertical Pod Autoscaler (VPA) in recommendation mode to suggest right-sized resource requests. After 72 hours of data collection, VPA recommended reducing resource requests across 31 services. We applied VPA recommendations to 12 services in staging on Wednesday afternoon.

Thursday morning disaster:

5 services experienced OOMKilled errors during load testing
3 services had CPU throttling (P99 latency increased 240%)
1 service crashed repeatedly (memory leak under load)

Root Cause

VPA recommendations based on 72-hour average usage, but:

Load testing hadn't run during VPA observation period
Weekly batch jobs (Sunday nights) require 3.5x normal RAM
Memory leak in user-sync service only appeared after 6+ hours uptime
JVM heap tuning in analytics-pipeline required headroom beyond actual usage

The Reversal

Within 4 hours:

Reverted all 12 services to original resource requests
All services stabilized
Zero production impact (staging caught everything)

The Fix

We changed our approach to VPA recommendations:

Extended observation period: 14 days minimum (not 72 hours)
Load testing inclusion: Ran full load tests during VPA observation
P95 recommendations, not average: VPA configured to recommend P95 usage + 20% buffer
Service-by-service rollout: Test one service for 48 hours before next
Memory leak detection: Identified user-sync memory leak (fixed by dev team)
JVM-specific handling: Java services get +35% heap headroom beyond actual usage

Lesson: VPA recommendations are great starting points, but blind application will break things. Always account for periodic workloads, load testing patterns, and language-specific memory management.

Implementation Approach

Phase 1: Visibility Foundation (Week 1)

Step 1: Kubecost Deployment

Installed Kubecost using Helm in all 3 EKS clusters:

Cluster 1 (Production): Kubecost with 7-day retention, Prometheus integration
Cluster 2 (Staging): Same configuration as production
Cluster 3 (Shared Services): Monitoring only (no cost attribution needed)
Storage: 50GB PVC per cluster for Kubecost data retention

Step 2: Label Strategy Design

Designed standardized label taxonomy with engineering leadership:

Label	Example Values	Purpose
team	campaign-squad, analytics-squad, platform	Team ownership & chargeback
service	email-processor, api-gateway, user-sync	Service-level cost tracking
environment	production, staging, dev	Environment cost segregation
cost-center	engineering, data, infrastructure	Finance department mapping

Step 3: Label Implementation

Coordinated with 8 teams to label all deployments:

Day 1-2: Platform team labeled shared infrastructure (12 services)
Day 3-5: Squad leads labeled their services (35 services)
Challenge: 3 legacy services had no clear owners — assigned to platform team temporarily
Validation: Script to detect unlabeled pods, Slack alerts for new deployments without labels

Step 4: Dashboard Setup

Created custom Grafana dashboards fed by Kubecost API:

Executive dashboard: Total EKS cost, cost by team, cost trend (7/30/90 day)
Team dashboards: Each squad's cost breakdown by service, efficiency metrics
Service dashboards: Individual service cost, resource utilization, right-sizing recommendations

Phase 1 Result: 94% cost attribution (3 legacy services = 6% unattributed)

Phase 2: Right-Sizing with VPA (Week 2)

Step 1: VPA Deployment & Configuration

After learning from initial VPA failure:

VPA installation: Deployed VPA v0.13 with updateMode: "Off" (recommendation only)
Observation period: 14 days minimum (not 72 hours)
Configuration: P95 target utilization + 20% buffer, not average
Exclusions: Stateful services (databases, caches) excluded from VPA recommendations

Step 2: Load Testing Integration

Ensured VPA observed real-world peak usage:

Load tests: Ran weekly load tests during VPA observation period
Batch jobs: Manually triggered Sunday night batch jobs mid-week for observation
Traffic replay: Replayed production traffic patterns in staging (using GoReplay)

Step 3: Recommendation Review Process

Implemented approval workflow before applying VPA recommendations:

Service	Current CPU	VPA Rec.	Final	Notes
email-processor	2000m	650m	750m	Added 15% buffer
analytics-pipeline (JVM)	1500m	850m	1150m	JVM heap needs +35%
user-sync	1000m	400m	1000m	Memory leak found — keep current
campaign-executor	800m	550m	550m	Applied as-is

Step 4: Phased Rollout

Rolled out service-by-service with validation gates:

Day 1: Applied to 5 lowest-risk services (staging only)
Day 2: Validated performance for 24 hours before proceeding
Day 3-5: Rolled out 12 services in staging, 8 services in production
Day 6-7: Applied remaining 22 services after validation

Phase 3: Node Optimization

Step 1: Capacity Analysis

After right-sizing pods, calculated actual node requirements:

Production: 520 vCPU requests → 320 vCPU requests (38% reduction)
Node calculation: 320 vCPU / (8 vCPU per m5.2xlarge * 0.75 utilization target) = 11.3 nodes
Decision: Target 12 nodes (was 18), giving 68% target utilization

Step 2: Node Removal Strategy

Safely removed excess nodes:

Cordon nodes: Marked 6 nodes as unschedulable
Graceful drain: Drained pods with 5-minute termination grace period
Pod rescheduling: Kubernetes moved pods to remaining nodes
Validation: Monitored P99 latency, error rates for 48 hours
Termination: Terminated drained nodes after validation

Step 3: Cluster Autoscaler Configuration

Enabled autoscaling with proper bounds:

Production Cluster Autoscaler Config:

min-nodes: 8  # Handle 50% utilization at minimum
max-nodes: 18  # Allow 2.25x scale for traffic spikes
scale-down-delay-after-add: 10m
scale-down-unneeded-time: 10m
scale-down-utilization-threshold: 0.65

Phase 4: Cost Culture Establishment

Step 1: Automated Reporting

Built cost visibility into team workflows:

Weekly Slack reports: Python script queries Kubecost API, posts to team channels
Report format: "Campaign Squad: $2,140 this week (+8% vs last week). Top services: campaign-executor ($850), campaign-scheduler ($580)"
Budget alerts: Automated alerts when team exceeds 110% of monthly budget

Step 2: Dashboard Distribution

Provided each team with self-service visibility:

Grafana access: SSO integration, team-scoped dashboards
Cost trend charts: 7/30/90 day rolling windows
Service comparison: Cost per service within team's portfolio
Right-sizing opportunities: VPA recommendations visible to teams

Step 3: Monthly Cost Reviews

Established recurring cost review meetings:

Engineering leadership: Monthly review with VPE, platform lead, finance
Team reviews: Quarterly reviews with each squad
Budget allocation: Each squad assigned quarterly Kubernetes budget
Accountability: Cost growth requires justification (new features, user growth, etc.)

Step 4: Knowledge Transfer

Trained platform team to maintain cost optimization:

Documentation: Runbook for VPA recommendation review process
Training sessions: 2-hour workshop on Kubecost, resource requests, right-sizing
Handoff: Platform engineer now owns cost monitoring (2 hours/week)

Results in Detail

Cost Savings Breakdown

Component	Before	After	Monthly Savings
Production nodes (18 → 12 m5.2xlarge)	$7,344	$4,896	−$2,448 (33.3%)
Staging nodes (8 → 5 m5.xlarge)	$2,764	$1,728	−$1,036 (37.5%)
Shared services (unchanged)	$1,232	$1,176	−$56 (4.5%)
Total EKS	$11,340	$7,800	−$3,540 (31.2%)

Efficiency Improvements

Node CPU utilization: 42% → 68% (+26pp)
Node RAM utilization: 35% → 64% (+29pp)
Pod resource request accuracy: 36% → 82%

Over-requested CPU: 2.8x → 1.2x
Over-requested RAM: 2.7x → 1.3x
Cost attribution: 6% → 94%

Business Value

Immediate Impact

$3,540/month = $42,480 annual savings
Engineering teams now have cost feedback loop
Platform team can justify infrastructure investments with cost data

Long-term Value

Cost culture established: Teams now consider cost when designing services
Unit economics visibility: Can calculate cost per customer (was impossible before)
Budget accountability: Each squad has quarterly Kubernetes budget
Architectural decisions informed by cost: Campaign squad considering consolidating 3 services into 1 after seeing combined cost

Real Example: Campaign Squad Cost Awareness

Campaign squad saw their services cost $2,140/mo (18.8% of total EKS spend). Realized campaign-scheduler was $580/mo for a service handling 200 req/day. Refactored campaign-scheduler to use Lambda (now costs $14/mo). Squad voluntarily reduced their costs by $566/mo through architectural optimization.

This is the power of cost visibility — teams self-optimize when they see the numbers.

Lessons Learned

What Worked

Kubecost as visibility tool: Open-source edition sufficient, no enterprise features required
Label standardization: Enforcing consistent labels made attribution possible
Team engagement: Weekly cost reports in Slack created positive peer pressure
Staging-first rollout: VPA disaster caught in staging, zero production impact

What Didn't Work

72-hour VPA observation: Way too short, missed periodic workloads
Blind VPA application: Caused 5 OOMKilled errors and 3 CPU throttling incidents
Assumption about steady-state workloads: Many services have weekly/monthly batch jobs that spike resource usage

Key Takeaways

Cost visibility drives behavior change: Teams voluntarily optimized when they saw their spend
VPA is powerful but dangerous: Great recommendations, but requires careful validation before application
Observation period matters: 14 days minimum to capture periodic workloads
Language-specific tuning: JVM services need different resource headroom than Go services
Cost culture > cost optimization: Establishing accountability delivers compounding returns

Need Kubernetes Cost Visibility?

If your EKS infrastructure has grown without cost attribution, we can help establish team-level accountability and optimize resource utilization.

Schedule a Free Assessment

2-week engagement • Read-only audit • Reversible changes • No commitment