ebs storage fintech FinTech (Payment Processing) • Series A, 35-80 employees

Case Study: Cutting FinTech Storage Costs 27.4% While Meeting Compliance Requirements

How we reduced EBS costs from $14,200 to $10,310/month for a payment processing platform while maintaining IOPS performance and meeting SOC 2 audit requirements

Monthly AWS Spend
$47,000
Cost Reduction
27.4%
Timeline
2 weeks
Published
Wed Jan 15 2025

At a Glance

Client Profile

  • Industry: Payment processing platform
  • Company Stage: Series A, 2,500+ merchants
  • Infrastructure: 186 EBS volumes (780 TB total)
  • Timeline: 2-week engagement, January 2025

Key Challenge

High storage costs combined with strict compliance requirements (SOC 2, PCI-DSS). Critical constraint: Zero downtime allowed during business hours (6 AM - 11 PM EST).

Primary Pain Point: Storage budget growing 8-12% monthly alongside merchant acquisition, with 62 orphaned volumes and no snapshot lifecycle policy.

27.4%
Monthly Cost Reduction
$14,200 → $10,310/month
+12%
IOPS Consistency
P99 latency improved
62
Orphaned Volumes Removed
38 TB reclaimed

The Situation

This Series A payment processing platform had grown rapidly over 18 months, onboarding 2,500+ merchants. Their infrastructure supported transaction processing, merchant data warehousing, and compliance logging across 186 EBS volumes totaling 780 TB.

Storage costs were their third-largest AWS expense at $14,200/month, growing 8-12% monthly alongside merchant acquisition. Their CFO reached out after noticing storage costs weren't scaling efficiently with revenue.

The CTO's primary concern: "We need to optimize, but we can't risk SOC 2 audit findings or customer-facing incidents. Every change needs documentation and zero downtime."

Discovery Phase

Initial Assessment

We performed read-only analysis using AWS Cost Explorer, CloudWatch, and EBS metrics to understand their storage footprint.

Business Context

  • Revenue Model: Transaction fees (2.9% + $0.30 per payment) processing $45M monthly volume
  • Growth Stage: Series A with 2,500+ merchants, adding 150-200 merchants/month
  • Compliance Requirements: SOC 2 Type II, PCI-DSS Level 1, quarterly audits
  • Data Retention: 7-year transaction history required by regulation
  • Critical Constraint: Zero downtime during business hours (6 AM - 11 PM EST)
  • Team Size: 6 platform engineers supporting infrastructure

Infrastructure Inventory

Workload Volume Count Total Size Monthly Cost Avg IOPS
Primary transaction DB (gp2) 15 120 TB $3,840 3,800
Merchant data warehouse (gp2) 52 285 TB $3,210 1,100
Compliance logs (gp2) 89 337 TB $1,870 450
Dev/staging environments (gp2) 30 38 TB $0 200
Total Active 186 780 TB $8,920 -
Orphaned volumes 62 38 TB $2,850 0

Key Findings

  • Orphaned volumes: 62 unattached volumes (38 TB) = $2,850/month waste
  • Snapshot bloat: 1,847 snapshots with no lifecycle policy
  • All gp2 volumes: No migration to cheaper gp3 (launched 2020)
  • IOPS utilization: Only 23% average, indicating over-provisioning
  • Performance baseline: Transaction DB P99 latency at 12ms (healthy)

Compliance Documentation: Worked with compliance team to ensure all changes met SOC 2 requirements: change approval tickets, performance baselines, rollback procedures, and audit trails.

The Challenge: gp3 IOPS Under-Provisioning

Our initial migration plan targeted converting the primary transaction database from gp2 to gp3 during a Sunday 2 AM maintenance window. The conversion completed successfully, but reality had other plans.

What Happened

Sunday 2:00 AM: Successfully converted 15 primary DB volumes from gp2 to gp3 with standard 3,000 IOPS baseline. Monitoring showed successful completion.

Monday 9:30 AM: Load testing revealed P99 query latency increased from 12ms to 19ms (+58% degradation). Investigation showed gp2 had been using burst credits to reach 3,800 IOPS during peak periods.

Monday 11:15 AM: Immediate rollback decision. Reverted all 15 volumes back to gp2 within 2 hours. Performance returned to baseline (11ms P99 latency). Zero customer-facing impact due to early detection during load testing.

Tuesday-Wednesday: Re-analyzed burst credit utilization, provisioned gp3 with 4,200 IOPS (not 3,000 baseline). Successfully migrated with improved performance: 10.5ms P99 latency (12% better than original).

Lesson Learned: Always analyze burst credit utilization before gp2→gp3 migration. CloudWatch IOPS metrics don't clearly show burst dependency. Baseline gp3 (3,000 IOPS) may underperform burst-dependent gp2 workloads.

Implementation Approach

Phase 1: Low-Risk Cleanup (Week 1)

Quick Wins

  • Orphaned volumes: Removed 62 unattached volumes after 7-day grace period = $2,850/month savings
  • Snapshot lifecycle: Implemented DLM policy (7/30/365 day retention), deleted 1,241 stale snapshots = $500/month savings
  • Tagging: Tagged all volumes with cost center and application ID for future attribution

Phase 1 Result: $3,340/month savings (23.5% of total) with zero risk

Phase 2: gp2 → gp3 Migration (Week 2)

Migration Strategy

Migrated 124 volumes with zero-downtime approach after fixing IOPS tuning methodology:

Step 1: IOPS Analysis & Planning
  • Metric collection: Exported 30-day CloudWatch metrics for VolumeReadOps, VolumeWriteOps, BurstBalance
  • Burst analysis: Identified volumes with BurstBalance <50% (indicating burst dependency)
  • IOPS calculation: P95 actual usage + 20% buffer for each workload type
  • Cost modeling: Created spreadsheet comparing gp2 cost vs gp3 with provisioned IOPS
Step 2: Testing & Validation
  • Dev environment: Migrated 5 dev volumes first to validate process
  • Load testing: Replayed production traffic patterns for 48 hours
  • Monitoring setup: CloudWatch alarms for P99 latency, IOPS throttling, queue depth
  • Rollback procedure: Documented ModifyVolume API commands for quick reversion
Step 3: Phased Migration
Workload Volumes gp3 IOPS Migration Window Status
Compliance logs (lowest risk) 89 500 Sunday 2-4 AM āœ“ Success
Warehouse (variable load) 52 1,200 Tuesday 2-4 AM āœ“ Success
Primary DB (1st attempt) 15 3,000 Sunday 2-4 AM āœ— Rolled back
Primary DB (2nd attempt) 15 4,200 Wednesday 2-4 AM āœ“ Success
Step 4: Post-Migration Validation
  • Performance monitoring: 7-day observation period for each workload
  • Cost verification: Compared actual AWS bills against projections
  • Compliance documentation: Updated change tickets with results for SOC 2 audit
  • Team handoff: Trained platform engineers on gp3 IOPS tuning methodology

Monitoring & Optimization

Implemented automated monitoring to track performance and cost:

  • CloudWatch Dashboard: Real-time IOPS utilization, burst balance (for remaining gp2), latency P99
  • Cost anomaly detection: AWS Cost Anomaly Detection configured for 20% storage variance
  • Weekly reports: Automated report showing IOPS headroom for right-sizing opportunities
  • DLM automation: Data Lifecycle Manager policies for 7/30/365 day snapshot retention

Results in Detail

Cost Savings Breakdown

Component Before After Savings
Active volumes (gp2 → gp3) $8,920 $6,380 āˆ’$2,540 (28.5%)
Orphaned volumes $2,850 $0 āˆ’$2,850 (100%)
Snapshots $2,430 $1,930 āˆ’$500 (20.6%)
Total Storage $14,200 $10,310 āˆ’$3,890 (27.4%)

Performance Impact

Transaction Database

P99 latency: 12ms → 10.5ms (12% improvement)

Compliance Logging

Throughput increased 8% due to optimized IOPS allocation

Business Value

Immediate Impact

  • $3,890/month = $46,680 annual savings
  • Funds 1.2 additional engineering FTEs for feature development
  • Extended Series A runway by ~45 days

Long-term Value

  • DLM automation prevents future snapshot bloat
  • Storage cost growth decoupled from merchant growth
  • Predictable IOPS performance (no burst credit dependency)
  • Audit-ready change documentation for SOC 2 compliance

Lessons Learned

āœ“ What Worked

  • Compliance-first approach: Involving compliance team early prevented blockers
  • Performance baselining: Detailed IOPS analysis prevented under-provisioning
  • Phased rollout: Low-risk cleanup first built confidence
  • Rollback readiness: 2-hour reversal with zero customer impact

āœ— What Didn't Work

  • Initial gp3 tuning: Underestimated burst credit dependency
  • Workload assumptions: Transaction DB had 25% IOPS variance during peaks

Key Takeaways

  • Never trust defaults: gp3 baseline IOPS (3,000) isn't always sufficient for gp2 migration
  • Burst credits are invisible debt: CloudWatch IOPS metrics don't clearly show burst utilization
  • Compliance adds time, not complexity: With proper documentation, change management was smooth
  • Low-hanging fruit first: Orphaned volumes built trust and funded more complex optimizations

Optimize Your Storage Costs

If your AWS infrastructure has grown organically without storage optimization, we can help you reduce costs while maintaining compliance requirements.

Schedule a Free Assessment

2-week engagement • Read-only audit • Reversible changes • SOC 2 compliant