Azure Monitor metrics pricing: estimate custom metrics, retention, and API calls
Metrics systems are "time series * frequency * retention". Costs spike when you accidentally create a huge number of distinct series (high cardinality) or when dashboards/alerts query wide time windows frequently. A good estimate makes cardinality explicit instead of hoping it stays small.
0) Define what a "time series" is in your model
A time series is a unique metric name plus a unique combination of dimension/label values. If you add dimensions like pod, container, path, or customerId, the number of unique combinations can explode.
1) Estimate time series count (cardinality)
Model cardinality explicitly. A simple approximation is: time_series ~= metrics * (dim1_values * dim2_values * ...).
- Good dimensions: environment, region, service (bounded sets).
- Dangerous dimensions: userId, requestId, URL path, pod name (unbounded or high churn).
- If you need per-entity detail, consider sampling or aggregating before emitting metrics.
2) Sample rate (frequency)
Sample rate multiplies ingestion volume. Going from 60s to 10s is a 6x increase. For planning, model both a "normal" and a "high-frequency" scenario (and justify why you need high frequency).
3) Retention
Retention is a storage multiplier. Long retention can be expensive if you store high-resolution data for months. A common pattern is: keep high-res for days, downsample or keep aggregates for weeks/months.
4) Dashboards, alerts, and API calls
Repeated queries (dashboards refreshing every minute, alert rules scanning 24h windows) can create significant query/API load. Model refresh frequency explicitly.
- A dashboard refreshing every minute is 1,440 refreshes/day.
- An alert evaluating every minute with a 24h window repeatedly scans the same historical data.
Worked estimate template (copy/paste)
- Time series = metrics * product(dim value counts)
- Samples/month = time series * samples/minute * minutes/month
- Retention = retention days (split high-res vs downsampled if applicable)
- Query load = dashboards/day + alerts/day (include refresh cadence)
Tool: Time series cost calculator.
Common pitfalls
- Unbounded or high-churn dimensions (customerId, requestId, pod name) causing cardinality explosion.
- Using high-frequency sampling everywhere instead of only where it adds value.
- Keeping long retention for high-resolution data by default.
- Dashboards/alerts querying wide windows with very frequent refresh.
- Emitting per-request metrics (too granular) instead of aggregating.
How to validate
- List top dimensions and estimate their unique value counts (bounded vs unbounded).
- Validate emit/scrape intervals across environments (dev often differs from prod).
- Audit dashboards: refresh intervals, time windows, and number of panels (queries multiply).
- Audit alerts: evaluation frequency and window sizes (avoid repeated wide scans).