Case Study: Cutting FinTech Storage Costs 27.4% While Meeting Compliance Requirements
How we reduced EBS costs from $14,200 to $10,310/month for a payment processing platform while maintaining IOPS performance and meeting SOC 2 audit requirements
At a Glance
Client Profile
- Industry: Payment processing platform
- Company Stage: Series A, 2,500+ merchants
- Infrastructure: 186 EBS volumes (780 TB total)
- Timeline: 2-week engagement, January 2025
Key Challenge
High storage costs combined with strict compliance requirements (SOC 2, PCI-DSS). Critical constraint: Zero downtime allowed during business hours (6 AM - 11 PM EST).
Primary Pain Point: Storage budget growing 8-12% monthly alongside merchant acquisition, with 62 orphaned volumes and no snapshot lifecycle policy.
The Situation
This Series A payment processing platform had grown rapidly over 18 months, onboarding 2,500+ merchants. Their infrastructure supported transaction processing, merchant data warehousing, and compliance logging across 186 EBS volumes totaling 780 TB.
Storage costs were their third-largest AWS expense at $14,200/month, growing 8-12% monthly alongside merchant acquisition. Their CFO reached out after noticing storage costs weren't scaling efficiently with revenue.
The CTO's primary concern: "We need to optimize, but we can't risk SOC 2 audit findings or customer-facing incidents. Every change needs documentation and zero downtime."
Discovery Phase
Initial Assessment
We performed read-only analysis using AWS Cost Explorer, CloudWatch, and EBS metrics to understand their storage footprint.
Business Context
- Revenue Model: Transaction fees (2.9% + $0.30 per payment) processing $45M monthly volume
- Growth Stage: Series A with 2,500+ merchants, adding 150-200 merchants/month
- Compliance Requirements: SOC 2 Type II, PCI-DSS Level 1, quarterly audits
- Data Retention: 7-year transaction history required by regulation
- Critical Constraint: Zero downtime during business hours (6 AM - 11 PM EST)
- Team Size: 6 platform engineers supporting infrastructure
Infrastructure Inventory
Workload | Volume Count | Total Size | Monthly Cost | Avg IOPS |
---|---|---|---|---|
Primary transaction DB (gp2) | 15 | 120 TB | $3,840 | 3,800 |
Merchant data warehouse (gp2) | 52 | 285 TB | $3,210 | 1,100 |
Compliance logs (gp2) | 89 | 337 TB | $1,870 | 450 |
Dev/staging environments (gp2) | 30 | 38 TB | $0 | 200 |
Total Active | 186 | 780 TB | $8,920 | - |
Orphaned volumes | 62 | 38 TB | $2,850 | 0 |
Key Findings
- Orphaned volumes: 62 unattached volumes (38 TB) = $2,850/month waste
- Snapshot bloat: 1,847 snapshots with no lifecycle policy
- All gp2 volumes: No migration to cheaper gp3 (launched 2020)
- IOPS utilization: Only 23% average, indicating over-provisioning
- Performance baseline: Transaction DB P99 latency at 12ms (healthy)
Compliance Documentation: Worked with compliance team to ensure all changes met SOC 2 requirements: change approval tickets, performance baselines, rollback procedures, and audit trails.
The Challenge: gp3 IOPS Under-Provisioning
Our initial migration plan targeted converting the primary transaction database from gp2 to gp3 during a Sunday 2 AM maintenance window. The conversion completed successfully, but reality had other plans.
What Happened
Sunday 2:00 AM: Successfully converted 15 primary DB volumes from gp2 to gp3 with standard 3,000 IOPS baseline. Monitoring showed successful completion.
Monday 9:30 AM: Load testing revealed P99 query latency increased from 12ms to 19ms (+58% degradation). Investigation showed gp2 had been using burst credits to reach 3,800 IOPS during peak periods.
Monday 11:15 AM: Immediate rollback decision. Reverted all 15 volumes back to gp2 within 2 hours. Performance returned to baseline (11ms P99 latency). Zero customer-facing impact due to early detection during load testing.
Tuesday-Wednesday: Re-analyzed burst credit utilization, provisioned gp3 with 4,200 IOPS (not 3,000 baseline). Successfully migrated with improved performance: 10.5ms P99 latency (12% better than original).
Lesson Learned: Always analyze burst credit utilization before gp2āgp3 migration. CloudWatch IOPS metrics don't clearly show burst dependency. Baseline gp3 (3,000 IOPS) may underperform burst-dependent gp2 workloads.
Implementation Approach
Phase 1: Low-Risk Cleanup (Week 1)
Quick Wins
- Orphaned volumes: Removed 62 unattached volumes after 7-day grace period = $2,850/month savings
- Snapshot lifecycle: Implemented DLM policy (7/30/365 day retention), deleted 1,241 stale snapshots = $500/month savings
- Tagging: Tagged all volumes with cost center and application ID for future attribution
Phase 1 Result: $3,340/month savings (23.5% of total) with zero risk
Phase 2: gp2 ā gp3 Migration (Week 2)
Migration Strategy
Migrated 124 volumes with zero-downtime approach after fixing IOPS tuning methodology:
Step 1: IOPS Analysis & Planning
- Metric collection: Exported 30-day CloudWatch metrics for VolumeReadOps, VolumeWriteOps, BurstBalance
- Burst analysis: Identified volumes with BurstBalance <50% (indicating burst dependency)
- IOPS calculation: P95 actual usage + 20% buffer for each workload type
- Cost modeling: Created spreadsheet comparing gp2 cost vs gp3 with provisioned IOPS
Step 2: Testing & Validation
- Dev environment: Migrated 5 dev volumes first to validate process
- Load testing: Replayed production traffic patterns for 48 hours
- Monitoring setup: CloudWatch alarms for P99 latency, IOPS throttling, queue depth
- Rollback procedure: Documented ModifyVolume API commands for quick reversion
Step 3: Phased Migration
Workload | Volumes | gp3 IOPS | Migration Window | Status |
---|---|---|---|---|
Compliance logs (lowest risk) | 89 | 500 | Sunday 2-4 AM | ā Success |
Warehouse (variable load) | 52 | 1,200 | Tuesday 2-4 AM | ā Success |
Primary DB (1st attempt) | 15 | 3,000 | Sunday 2-4 AM | ā Rolled back |
Primary DB (2nd attempt) | 15 | 4,200 | Wednesday 2-4 AM | ā Success |
Step 4: Post-Migration Validation
- Performance monitoring: 7-day observation period for each workload
- Cost verification: Compared actual AWS bills against projections
- Compliance documentation: Updated change tickets with results for SOC 2 audit
- Team handoff: Trained platform engineers on gp3 IOPS tuning methodology
Monitoring & Optimization
Implemented automated monitoring to track performance and cost:
- CloudWatch Dashboard: Real-time IOPS utilization, burst balance (for remaining gp2), latency P99
- Cost anomaly detection: AWS Cost Anomaly Detection configured for 20% storage variance
- Weekly reports: Automated report showing IOPS headroom for right-sizing opportunities
- DLM automation: Data Lifecycle Manager policies for 7/30/365 day snapshot retention
Results in Detail
Cost Savings Breakdown
Component | Before | After | Savings |
---|---|---|---|
Active volumes (gp2 ā gp3) | $8,920 | $6,380 | ā$2,540 (28.5%) |
Orphaned volumes | $2,850 | $0 | ā$2,850 (100%) |
Snapshots | $2,430 | $1,930 | ā$500 (20.6%) |
Total Storage | $14,200 | $10,310 | ā$3,890 (27.4%) |
Performance Impact
Transaction Database
P99 latency: 12ms ā 10.5ms (12% improvement)
Compliance Logging
Throughput increased 8% due to optimized IOPS allocation
Business Value
Immediate Impact
- $3,890/month = $46,680 annual savings
- Funds 1.2 additional engineering FTEs for feature development
- Extended Series A runway by ~45 days
Long-term Value
- DLM automation prevents future snapshot bloat
- Storage cost growth decoupled from merchant growth
- Predictable IOPS performance (no burst credit dependency)
- Audit-ready change documentation for SOC 2 compliance
Lessons Learned
ā What Worked
- Compliance-first approach: Involving compliance team early prevented blockers
- Performance baselining: Detailed IOPS analysis prevented under-provisioning
- Phased rollout: Low-risk cleanup first built confidence
- Rollback readiness: 2-hour reversal with zero customer impact
ā What Didn't Work
- Initial gp3 tuning: Underestimated burst credit dependency
- Workload assumptions: Transaction DB had 25% IOPS variance during peaks
Key Takeaways
- Never trust defaults: gp3 baseline IOPS (3,000) isn't always sufficient for gp2 migration
- Burst credits are invisible debt: CloudWatch IOPS metrics don't clearly show burst utilization
- Compliance adds time, not complexity: With proper documentation, change management was smooth
- Low-hanging fruit first: Orphaned volumes built trust and funded more complex optimizations
Optimize Your Storage Costs
If your AWS infrastructure has grown organically without storage optimization, we can help you reduce costs while maintaining compliance requirements.
Schedule a Free Assessment2-week engagement ⢠Read-only audit ⢠Reversible changes ⢠SOC 2 compliant