ECS autoscaling cost pitfalls (and how to avoid them)
Start with a calculator if you need a first-pass estimate, then use this guide to validate the assumptions and catch the billing traps.
Autoscaling fixes start only after you know whether noisy scaling, retry multiplication, oversized tasks, or traffic-driven side costs are the real ECS cost driver; otherwise teams tune the wrong system.
This page is for production intervention: signal cleanup, cooldown tuning, retry control, and scaling-safe validation.
Autoscaling should reduce cost by matching capacity to demand. When costs go up, it usually means one of three things: oscillation, retry storms, or non-compute line items are growing with traffic.
If the main uncertainty is still task shape or bill scope, go back to ECS task sizing or ECS pricing first.
1) Noisy signals cause oscillation
- CPU% can spike briefly (GC, cold caches, bursty traffic) and trigger scale-out.
- If scale-in is slow or conservative, you spend long periods above the true average.
- Fix: use smoothing windows, realistic targets, and cooldowns that match task startup time.
2) Target utilization is not a "maximize CPU" goal
Many teams set targets too high, then see latency and retries. That can increase both compute and non-compute costs.
- Pick a target that preserves headroom for deploys and bursts.
- Separate scaling for CPU-bound vs IO-bound services (CPU is not the only bottleneck).
- Validate with p95 latency and error rate, not only utilization.
3) Retries multiply traffic (and cost)
- Timeouts and transient errors trigger client retries and SDK retries.
- Retries increase request volume, logs, and sometimes egress/NAT.
- Fix: backoff, circuit breakers, and faster failure detection before scaling reacts.
4) Hidden line items scale with traffic
- Logs: ingestion grows with request volume and verbosity.
- NAT/egress: external calls and downloads can spike costs.
- Load balancers: capacity units can increase with connections and throughput.
5) Task sizing mistakes look like "autoscaling problems"
- Over-sized tasks keep cost high even when scaling works.
- Under-sized tasks cause timeouts and retries, which trigger scale-out and inflate costs.
- Fix: size tasks from measured p95 usage and validate headroom.
Stability checklist (quick wins)
- Scale-out should react faster than scale-in (avoid immediate oscillation).
- Match cooldowns to task startup time (slow startup + fast scale-in causes flapping).
- Use multiple signals for safety (latency/error rate + CPU), not CPU alone.
- Keep a "busy month" scenario: deploys and incidents change behavior.
Validation checklist
- Compare desired vs running tasks: do you spend long periods above baseline after spikes?
- Track retries/timeouts during spikes (cost and reliability signal).
- Track log ingestion GB/day and NAT processed GB during scaling events.
- After changes, validate p95 latency and error rate during a busy window.
Use a simple measure-change-remeasure loop
- Measure baseline desired tasks, running tasks, retries, and traffic-driven side costs during a representative week.
- Change one scaling lever at a time so the next cost comparison stays readable.
- Remeasure the same busy window and keep only the changes that reduce spend without hurting latency or reliability.
Sources
- ECS autoscaling: docs.aws.amazon.com
- CloudWatch pricing (logs/metrics often show up here): aws.amazon.com/cloudwatch/pricing