Azure NAT Gateway cost: model hours, GB processed, and the real spike drivers
NAT Gateway pricing is simple on paper and tricky in production. The math is "hours + GB processed", but the bill is driven by what traverses NAT during peak periods: deploys, scale-outs, incidents, and dependency retries. This guide helps you build a model you can validate and improve.
0) Inventory what routes through NAT (the boundary)
Before you estimate anything, list the outbound paths that actually use NAT.
- Which subnets have default routes that egress through NAT?
- Which workloads live in those subnets (AKS nodes, VMSS, build runners, jump boxes)?
- Which destinations are hit at scale (container registries, package repos, external APIs, SaaS, telemetry)?
If you cannot answer this, you will estimate the wrong thing. NAT costs are not "internet egress" in general; they are "traffic that your routing sends through NAT".
1) Baseline hours (count NAT gateways)
The baseline is the number of NAT gateways times hours per month. If you have multiple environments or regions, model each separately (prod/stage/dev). If your architecture is hub-and-spoke, be explicit about whether one NAT is shared or whether each spoke has its own.
2) GB processed (the throughput driver)
The practical way to estimate GB processed is to list the big outbound flows and estimate each flow's monthly GB. Do not guess with one blended number if you have a few dominant flows.
Tools: Data egress cost, Response transfer, Unit converter.
- Container image pulls: AKS/VMSS churn, new node pools, autoscaling.
- Package downloads: OS updates, language deps, CI caches.
- External APIs: request volume × response sizes (retries amplify).
- Telemetry/log shipping: logs and metrics exporters often run "always on".
3) Model the peak month (retries + churn)
A realistic model has at least two scenarios: baseline and peak. Peak is where NAT costs surprise teams.
- Retries/timeouts: each retry repeats the full payload transfer; add a multiplier for incident windows.
- Node churn: new nodes pull images and dependencies with cold caches.
- Cold-start spikes: deployment rollouts can temporarily increase outbound dependency calls.
4) Practical levers to reduce cost (without breaking security)
- Cache the big outbound flows: registry mirrors, package proxies, artifact caching.
- Move large dependencies off the hot path: avoid downloading large artifacts at runtime.
- Reduce retries: fix timeout budgets and backoff; retries are paid traffic.
- Consider private access for high-volume Azure services (compare against Private Link).
Related: Azure Private Link costs.
Worked estimate template (copy/paste)
- NAT gateways = count per env/region
- Hours/month = 24 × days
- Baseline GB/month = sum of big outbound flows routed via NAT
- Peak add-on GB = deploy + incident windows (image pulls + retries)
- Retry multiplier = 1 + retry_rate (apply to the affected flows)
How to validate
- Validate routing: which subnets and workloads are actually using NAT.
- Validate the top outbound destinations and their bytes in a representative week.
- Validate deploy/scale-out periods: compare GB/hour during peak vs baseline.
- After changes, re-measure outbound GB and confirm the model moves in the same direction as the bill.