EBS cost optimization: volumes, IOPS/throughput, and snapshots

EBS cost is usually not “mysterious”: it’s mostly GB-month, plus IOPS/throughput for some volume types, plus snapshots. The waste comes from unattached volumes, oversized volumes, and default performance settings that are higher than required.

EBS savings checklist

Right-size: remove over-provisioned volumes.
gp2 to gp3: lower cost for the same baseline IOPS.
Snapshots: prune retention and clean unused volumes.

Step 0: identify your dominant driver

Capacity: large volumes provisioned far above actual usage.
Performance: provisioned IOPS/throughput set above what workloads use.
Snapshots: long retention and frequent snapshots on large changing datasets.

EBS cost calculator EBS snapshot cost

High-leverage savings levers

Delete unattached volumes: orphaned volumes accumulate after instance termination and migrations.
Right-size GB: reduce volume size where safe (after validating used space and growth).
Choose the right type: gp3 often provides better cost control than gp2 for many workloads.
Right-size IOPS/throughput: set based on measured utilization, not defaults.
Snapshot lifecycle: keep only what you need; avoid keeping daily snapshots forever.

Common cost traps

Oversized root volumes (default AMI settings) across large fleets.
Provisioned performance far above actual usage (especially for “just in case”).
Snapshots without lifecycle policies, retained indefinitely.
Staging/dev volumes with production-sized disks and retention policies.

Snapshot cost drivers (what actually increases snapshot GB)

Change rate: snapshots store changed blocks over time; write-heavy workloads can grow snapshot usage.
Retention: keeping daily snapshots for months usually dominates.
Copies: copied snapshots across regions or accounts create additional stored GB.

If snapshots are a top line item, start by reviewing retention and copies before touching performance settings.

Right-sizing workflow (practical)

List top volumes by GB-month cost and identify unattached volumes.
For each class, measure used space, growth, and p95 IOPS/throughput.
Decide: reduce size, change type (gp2 vs gp3), or reduce provisioned performance.
Validate in canary, then roll across the fleet with monitoring and rollback.

Validation checklist

For each volume class, measure used space and growth rate (busy month included).
Measure IOPS and throughput utilization before changing performance settings.
For gp2->gp3 changes, validate latency and throughput under representative load.
After snapshot policy changes, validate restore requirements (RPO/RTO) are still met.