ECS autoscaling cost pitfalls (and how to avoid them)

Autoscaling should reduce cost by matching capacity to demand. When costs go up, it usually means one of three things: oscillation, retry storms, or non-compute line items are growing with traffic.

1) Noisy signals cause oscillation

  • CPU% can spike briefly (GC, cold caches, bursty traffic) and trigger scale-out.
  • If scale-in is slow or conservative, you spend long periods above the true average.
  • Fix: use smoothing windows, realistic targets, and cooldowns that match task startup time.

2) Target utilization is not a "maximize CPU" goal

Many teams set targets too high, then see latency and retries. That can increase both compute and non-compute costs.

  • Pick a target that preserves headroom for deploys and bursts.
  • Separate scaling for CPU-bound vs IO-bound services (CPU is not the only bottleneck).
  • Validate with p95 latency and error rate, not only utilization.

3) Retries multiply traffic (and cost)

  • Timeouts and transient errors trigger client retries and SDK retries.
  • Retries increase request volume, logs, and sometimes egress/NAT.
  • Fix: backoff, circuit breakers, and faster failure detection before scaling reacts.

4) Hidden line items scale with traffic

  • Logs: ingestion grows with request volume and verbosity.
  • NAT/egress: external calls and downloads can spike costs.
  • Load balancers: capacity units can increase with connections and throughput.

5) Task sizing mistakes look like "autoscaling problems"

  • Over-sized tasks keep cost high even when scaling works.
  • Under-sized tasks cause timeouts and retries, which trigger scale-out and inflate costs.
  • Fix: size tasks from measured p95 usage and validate headroom.

Stability checklist (quick wins)

  • Scale-out should react faster than scale-in (avoid immediate oscillation).
  • Match cooldowns to task startup time (slow startup + fast scale-in causes flapping).
  • Use multiple signals for safety (latency/error rate + CPU), not CPU alone.
  • Keep a "busy month" scenario: deploys and incidents change behavior.

Validation checklist

  • Compare desired vs running tasks: do you spend long periods above baseline after spikes?
  • Track retries/timeouts during spikes (cost and reliability signal).
  • Track log ingestion GB/day and NAT processed GB during scaling events.
  • After changes, validate p95 latency and error rate during a busy window.

Sources


Related guides

ECS cost model beyond compute: the checklist that prevents surprise bills
A practical ECS cost model checklist beyond compute: load balancers, logs/metrics, NAT/egress, cross-AZ transfer, storage, and image registry behavior. Use it to avoid underestimating total ECS cost.
ECS vs EKS cost: a practical checklist (compute, overhead, and add-ons)
Compare ECS vs EKS cost with a consistent checklist: compute model, platform overhead, scaling behavior, and the line items that often dominate (load balancers, logs, data transfer).
AWS Fargate pricing (cost model + pricing calculator)
A practical Fargate pricing guide and calculator companion: what drives compute cost (vCPU-hours + GB-hours), how to estimate average running tasks, and the non-compute line items that usually matter (logs, load balancers, data transfer).
EC2 cost estimation: a practical model (compute + the hidden line items)
A practical EC2 cost estimation guide: model instance-hours with uptime and blended rates, then add the hidden line items that often dominate (EBS, snapshots, load balancers, NAT/egress, logs).
Fargate vs EC2 cost: how to compare compute, overhead, and hidden line items
A practical Fargate vs EC2 cost comparison: normalize workload assumptions, compare unit economics (vCPU/memory-hours vs instance-hours), and include the line items that change the answer (idle capacity, load balancers, logs, transfer).
Lambda vs Fargate cost: a practical comparison (unit economics)
Compare Lambda vs Fargate cost with unit economics: cost per 1M requests (Lambda) versus average running tasks (Fargate), plus the non-compute line items that often dominate (logs, load balancers, transfer).

Related calculators


FAQ

Why does ECS autoscaling increase cost unexpectedly?
Because scaling triggers can be noisy and overreact to transient spikes. Without correct targets and cooldowns, the system can oscillate and spend most of the time above average capacity.
What non-compute costs grow with autoscaling?
Logs (ingestion), NAT/egress, and load balancer capacity can scale with traffic and retries, not just task count.

Last updated: 2026-01-27