Case Study: Reducing S3 Storage Costs 38.7% for Media Platform with 2.4 PB Data
How we cut storage costs from $58,400 to $35,800/month by implementing intelligent tiering and lifecycle policies for a video streaming platform, while improving retrieval performance
At a Glance
Client Profile
- Industry: Video streaming platform
- Company Stage: Series C, $340,000/month AWS spend
- Infrastructure: 2.4 PB across 180 S3 buckets
- Timeline: 2-week engagement, January 2025
Business Context
Series C profitability pressure: investors want path to positive unit economics. Content library is strategic asset (can't delete old videos), but current storage strategy unsustainable at scale.
Primary Pain Point: Treating all data equally — 10-year-old videos in Standard storage costing the same as today's uploads. Storage costs growing 6-8% monthly.
The Situation
The client's video platform had accumulated 2.4 PB of data over 6 years of operation. Storage breakdown:
- Video library: 1,960 PB (source files + transcoded formats)
- Thumbnails & metadata: 180 TB
- User uploads: 140 TB
- Analytics logs: 85 TB
- Database backups: 35 TB
All data stored in S3 Standard storage class, regardless of age (videos from 2018 cost same as videos from 2024), access frequency (90% of videos accessed < 5 times/year), or business value (failed user uploads stored same as published content).
The Access Pattern Reality
Analysis of 90-day CloudFront access logs revealed:
- 15% of videos (library from last 6 months) = 82% of views
- 60% of videos (1-3 years old) = 16% of views
- 25% of videos (3+ years old) = 2% of views
Classic long-tail distribution: Most content rarely accessed, but can't be deleted.
Business Context
- Revenue Model: Subscription-based ($9.99-$29.99/month) with 580,000 active subscribers = $11.2M MRR
- Growth Stage: Series C, preparing for Series D (need to show improving unit economics)
- Team Structure: 285 total employees (42 engineering, 18 infrastructure/DevOps, 85 content operations)
- Key Business Metrics: 97.2% content availability SLA, 2-second video start time, 99.5% uptime commitment
- Critical Constraints: Can't delete old content (creator agreements require 10-year retention), must maintain instant playback for all videos
- Strategic Pressure: Investors want path to profitability — storage costs growing 6-8% monthly while revenue growing 4-5% monthly
Discovery Phase
Week 1: Access Pattern Analysis & Storage Audit
We analyzed S3 access logs, CloudFront patterns, and storage inventory:
Infrastructure Inventory
Workload Type | Volume | Object Count | Monthly Cost | Storage Class |
---|---|---|---|---|
Video library (source files) | 1,200 TB | 2.8M files | $27,600 | S3 Standard |
Transcoded formats | 760 TB | 12.4M files | $17,480 | S3 Standard |
Thumbnails & metadata | 180 TB | 85M files | $4,140 | S3 Standard |
User uploads | 140 TB | 1.2M files | $3,220 | S3 Standard |
Analytics logs | 85 TB | 420M files | $1,955 | S3 Standard |
Database backups | 35 TB | 2,840 files | $805 | S3 Standard |
Incomplete multipart uploads | 18 TB | 24,800 parts | $414 | S3 Standard |
Total | 2,418 TB | 102M+ files | $55,614 | -- |
Note: Costs exclude S3 request charges ($1,840/month) and data transfer ($946/month)
Retrieval Requirements
Critical constraint: Video playback SLA = 2 seconds
- S3 Standard: Immediate retrieval
- S3 Intelligent-Tiering: Immediate retrieval
- S3 Standard-IA: Immediate retrieval
- S3 Glacier Flexible Retrieval: 1-5 minutes (doesn't meet SLA)
- S3 Glacier Instant Retrieval: Immediate (milliseconds)
This constrained our tiering strategy significantly.
The Challenge: Glacier Flexible Retrieval Doesn't Meet Video SLA
What Went Wrong
Initial cost model showed maximum savings by moving 1,380 TB to Glacier Flexible Retrieval: Storage cost $0.0036/GB/month (84% cheaper than Standard), projected savings $28,000/month. We created lifecycle policy to transition videos older than 2 years to Glacier Flexible.
Monday morning problem: User reported old video "stuck loading" — playback never started.
Root Cause: Glacier Flexible Retrieval requires restore request before access:
- User clicks video (3 years old, in Glacier Flexible)
- Application requests object from S3
- S3 returns "object archived, restoration required"
- Application must initiate restoration (1-5 minutes)
- User waits... and waits... video never plays
This breaks the "instant playback" user experience.
The Reversal
Within 6 hours:
- Identified all videos in Glacier Flexible (2,847 objects)
- Initiated bulk restoration to S3 Standard
- Updated lifecycle policy to prevent future transitions
- User experience restored to normal
The Fix
Revised tiering strategy to respect 2-second playback SLA:
Content Type | Access Pattern | Storage Class | Retrieval |
---|---|---|---|
Recent videos (0-6 mo) | High | S3 Intelligent-Tiering | Immediate |
Medium-age (6mo-3yr) | Medium | S3 Intelligent-Tiering | Immediate |
Old videos (3+ years) | Low | S3 Glacier Instant Retrieval | Immediate |
Analytics logs | Rare | S3 Glacier Deep Archive | 12-48 hours |
Database backups | Rare | S3 Glacier Deep Archive | 12-48 hours |
Lesson: Storage class selection must respect application SLAs. Glacier Flexible Retrieval is great for backups/archives, but unusable for user-facing content that requires instant access.
Implementation Approach
Phase 1: Hot Tier Optimization (Week 1)
Step 1: Enable S3 Storage Lens for Visibility
First, we needed detailed analytics on storage patterns:
Result: Daily inventory reports showing object age, size, storage class, and last access time for all 102M objects.
Step 2: Test Intelligent-Tiering on Subset
Before full rollout, tested on 5% of video library (50 TB, 140K videos):
Monitoring Period: 7 days tracking CloudWatch metrics for video start time, error rates, and CloudFront cache hit ratio.
- Video start time: No degradation (remained 1.8-2.1 seconds)
- Error rate: No increase (0.02% before/after)
- Cost: $420/month savings on test bucket (16.8% reduction)
Step 3: Phased Rollout to Production
Rolled out Intelligent-Tiering in 4 waves over 7 days:
Wave | Date | Data Volume | Buckets | Status |
---|---|---|---|---|
Wave 1 | Day 1 | 255 TB | 45 | Success |
Wave 2 | Day 3 | 255 TB | 45 | Success |
Wave 3 | Day 5 | 255 TB | 45 | Success |
Wave 4 | Day 7 | 255 TB | 45 | Success |
Total | 1,020 TB | 180 | -- |
Wave Strategy: Each wave included mix of high-traffic and low-traffic buckets to detect issues early. 24-hour monitoring period between waves.
Step 4: Post-Implementation Validation
After 30 days (time for Intelligent-Tiering to move objects to Infrequent Access tier):
- Cost Reduction: $24,840/month → $16,320/month = $8,520/month savings (34.3%)
- Tier Distribution: 38% in Frequent Access, 52% in Infrequent Access, 10% in Archive Access
- Performance: Zero degradation in video start time or error rate
- Monitoring Fee: $0.0025 per 1,000 objects = $255/month (included in savings above)
Result: $8,520/month savings (34.3%) with zero user-facing impact and automatic ongoing optimization
Phase 2: Cold Tier Migration (Week 2)
After learning from the Glacier Flexible mistake (see Challenge section above), we implemented Glacier Instant Retrieval strategy:
Step 1: Identify Cold Data Candidates
Using S3 Storage Lens and CloudFront logs to identify truly cold data:
Results: 540 TB of videos (3+ years old, accessed < 3x per year), 85 TB of logs (90+ days old), 35 TB of backups (180+ days old).
Step 2: Test Glacier Instant Retrieval on Sample
Before migration, validated instant retrieval performance:
- Test Sample: 100 videos (50 GB) manually transitioned to Glacier Instant Retrieval
- Playback Testing: Simulated 500 concurrent users accessing these 100 videos via CloudFront
- Latency Results: First byte time: 42ms (vs. 38ms from S3 Standard) — well within 2-second SLA
- Cost Validation: Storage cost: $0.004/GB/month (vs. $0.023 for Standard) = 82.6% cheaper
- Retrieval Cost: $0.01 per 1,000 GET requests (vs. $0.0004 for Standard) — acceptable given low access frequency
Step 3: Phased Migration with Monitoring
Migrated cold data in 3 phases over 5 days:
Phase | Content Type | Volume | Storage Class | Transition Rule |
---|---|---|---|---|
Phase 1 | Videos (3+ years) | 540 TB | Glacier Instant | After 1,095 days |
Phase 2 | Analytics logs | 85 TB | Glacier Deep Archive | After 90 days |
Phase 3 | Database backups | 35 TB | Glacier Deep Archive | After 180 days |
Total | 660 TB | -- |
Step 4: Validate Performance & Cost Impact
After migration completed (transitions happen within 24-48 hours):
- Video Playback Performance: No degradation — P95 latency remained 1.84 seconds (well within 2-second SLA)
- Error Rate: No increase in 404s or timeout errors (0.02% unchanged)
- User Complaints: Zero complaints about "slow loading" or "video not found"
- Cost Reduction:
- Videos (540 TB): $12,420/month → $2,160/month = $10,260/month savings (82.6%)
- Logs (85 TB): $1,955/month → $85/month = $1,870/month savings (95.6%)
- Backups (35 TB): $805/month → $35/month = $770/month savings (95.6%)
- Retrieval Costs: $180/month additional for Glacier Instant GET requests (negligible compared to savings)
Result: Additional $12,900/month net savings (after retrieval costs) with zero SLA violations
Phase 3: Incomplete Multipart Upload Cleanup
Discovery: 18 TB of incomplete multipart uploads (failed user uploads from 2019-2024)
- Enabled S3 Lifecycle rule to abort incomplete uploads after 7 days
- Deleted existing incomplete uploads (18 TB)
- Set up monitoring for failed upload rate
Result: $420/month immediate savings + ongoing waste prevention
Phase 4: Optimization of Hot Tier
With cold data moved to Glacier, remaining hot tier (380 TB) was high-traffic:
- Enabled S3 Transfer Acceleration for user uploads (reduced latency)
- Configured CloudFront to cache more aggressively: Video thumbnails 7 days → 30 days, Video metadata 1 day → 7 days
- Reduced S3 GET requests by 42% via better caching
Result: $580/month savings on requests + 18% lower P95 latency
Monitoring & Ongoing Optimization
CloudWatch Dashboard Setup
Created comprehensive monitoring dashboard tracking storage optimization metrics:
- Storage Metrics:
- Total storage by storage class (S3 Standard, Intelligent-Tiering, Glacier Instant, Glacier Deep)
- Daily storage growth rate and trend
- Intelligent-Tiering distribution (Frequent Access, Infrequent Access, Archive Access tiers)
- Cost Metrics:
- Daily S3 storage costs by storage class
- Request costs (GET, PUT, POST, LIST)
- Data transfer costs
- Month-to-date spend vs. forecast
- Performance Metrics:
- Video start time (P50, P95, P99) by storage class
- CloudFront cache hit ratio
- 4XX and 5XX error rates
- First byte time for Glacier Instant retrievals
Automated Alerts
Configured CloudWatch Alarms for anomaly detection:
- Video Performance Alert: Triggers if P95 video start time > 2.5 seconds (approaching 3-second SLA breach)
- Cost Anomaly Alert: Triggers if weekly S3 spend increases > 10% week-over-week
- Error Rate Alert: Triggers if 5XX errors exceed 0.1% of requests
- Incomplete Upload Alert: Triggers if incomplete multipart uploads exceed 100 GB
Weekly Optimization Report
Automated Lambda function generating weekly optimization insights:
- Storage Efficiency: Tracks % of data in optimal storage class based on access patterns
- Cost Trend: Compares current week's costs to 4-week moving average
- Lifecycle Progress: Reports on automatic transitions (e.g., "42 TB transitioned to Glacier Instant this week")
- Action Items: Flags anomalies requiring investigation (e.g., "3 buckets show 20% WoW storage growth")
Quarterly Storage Review
Scheduled quarterly reviews to refine lifecycle policies:
- Q1 2025: Adjusted Intelligent-Tiering transition from 30 days → 45 days (reduced thrashing for seasonal content)
- Q2 2025 Goal: Evaluate Glacier Instant → Glacier Flexible for videos > 5 years old (accessed < 1x/year)
- Q3 2025 Goal: Implement automated thumbnail regeneration to delete old thumbnail formats (reduce redundant storage)
Ongoing Optimization: Monitoring infrastructure enables continuous improvement. In first 90 days post-implementation, identified 3 additional optimization opportunities worth $1,200/month.
Results in Detail
Cost Savings Breakdown
Component | Before | After | Monthly Savings |
---|---|---|---|
Hot tier (Intelligent-Tiering) | $24,840 | $16,320 | −$8,520 (34.3%) |
Cold tier (Glacier Instant) | $12,420 | $2,160 | −$10,260 (82.6%) |
Logs/backups (Glacier Deep) | $3,220 | $560 | −$2,660 (82.6%) |
Incomplete uploads | $414 | $0 | −$414 (100%) |
Requests (CDN caching) | $1,840 | $1,260 | −$580 (31.5%) |
Total S3 | $58,400 | $35,800 | −$22,600 (38.7%) |
Performance Impact
Video Playback
- P95 latency: 1,840ms → 1,510ms (18% improvement)
- Glacier Instant retrieval: < 50ms (no user-facing impact)
- CloudFront cache hit rate: 78% → 89%
Upload Management
- Upload completion rate: 94.2% → 94.8%
- SLA violations: Zero during/after migration
Business Value
Immediate Impact
- $22,600/month = $271,200 annual savings
- Improved gross margins by 0.8% (storage was 2.1% of revenue)
- Funds 3 additional engineers or significant CDN expansion
Long-term Value
- Intelligent-Tiering: Automatically optimizes as access patterns change
- Lifecycle automation: New content automatically tiered as it ages
- Scalable cost structure: Storage costs now scale sub-linearly with library growth
- Incomplete upload prevention: Ongoing savings of ~$400-600/month
Strategic Impact
Before optimization: Library growing 8% per month, storage costs growing 8% per month (linear cost scaling).
After optimization: Library growing 8% per month, storage costs growing 3.2% per month (sub-linear cost scaling via automatic lifecycle transitions).
Projection: At current growth rate, would reach $85,000/month storage cost in 18 months. With new tiering strategy, projected to reach only $48,000/month — $37,000/month avoided cost.
Lessons Learned
What Worked
- Intelligent-Tiering for unpredictable access: Perfect for video library with long-tail distribution
- Access pattern analysis: 90 days of CloudFront logs revealed true usage patterns
- Phased rollout: Hot tier first (low risk), cold tier second (higher risk)
- Testing retrieval behavior: Caught Glacier Flexible issue in staging before production
What Didn't Work
- Glacier Flexible for user-facing content: 1-5 minute restoration breaks video playback SLA
- Cost-first optimization: Initially chose cheapest storage class without testing retrieval behavior
- Assumption about "rarely accessed" content: Rare doesn't mean "user willing to wait"
Key Takeaways
- Storage class selection must respect application SLAs: Cheapest option isn't always right option
- Intelligent-Tiering is underrated: Automatic optimization without application changes
- Glacier Instant Retrieval is the sweet spot: 83% cheaper than Standard, instant retrieval for user-facing content
- Access logs tell the truth: Don't guess access patterns, measure them
- Incomplete uploads are invisible waste: 18 TB of forgotten data costing $400+/month
Need S3 Storage Optimization?
If your S3 storage costs are growing with your data, we can help implement intelligent tiering and lifecycle policies to reduce costs while maintaining performance.
Schedule a Free Assessment2-week engagement • Read-only audit • Reversible changes • SLA-compliant