Cloud Monitoring metrics pricing (GCP): time series, sample rate, and retention
Metrics systems are "time series × frequency × retention". Costs spike when you accidentally create too many unique time series (high cardinality) or when dashboards/alerts query wide windows frequently. A good estimate makes cardinality explicit instead of hoping it stays small.
0) Define what a time series is
A time series is a unique metric name plus a unique combination of dimension/label values. If you add dimensions like pod, container, path, or customerId, the number of unique combinations can explode.
1) Estimate cardinality (time series count)
Model cardinality explicitly. A simple approximation is: series ~= metrics * (dim1_values * dim2_values * ...).
- Safe dimensions: environment, region, service (bounded sets).
- Dangerous dimensions: requestId, userId, URL path, pod name (unbounded or high churn).
- If you need per-entity detail, consider sampling or aggregating before emitting metrics.
Tool: Metrics time series cost calculator.
2) Sample rate (frequency)
Sample rate multiplies ingestion volume. Going from 60s to 10s is a 6× increase. Model both a "normal" and a "high-frequency" scenario and justify why you need high frequency.
3) Retention
Retention is a storage multiplier. Long retention can be expensive if you store high-resolution data for months. A common pattern is: keep high-res for days, keep downsampled aggregates for weeks/months.
4) Dashboards and alerts (repeated queries)
Dashboards refreshing frequently and alerts scanning wide windows can create repeated query load. Treat refresh rates and window sizes as explicit drivers.
- A dashboard refreshing every minute is 1,440 refreshes/day.
- An alert evaluating every minute with a 24h window repeatedly re-scans the same historical data.
Worked estimate template (copy/paste)
- Time series = metrics × product(dim value counts)
- Samples/month = time series × samples/minute × minutes/month
- Retention = retention days (split high-res vs downsampled if applicable)
- Query load = dashboards/day + alerts/day (include refresh cadence)
Common pitfalls
- Unbounded or high-churn dimensions causing cardinality explosions.
- Using high-frequency sampling everywhere instead of only where it adds value.
- Keeping long retention for high-resolution data by default.
- Dashboards/alerts querying wide windows with very frequent refresh.
- Emitting per-request metrics instead of aggregating.
How to validate
- List top dimensions and estimate unique value counts (bounded vs unbounded).
- Validate emit/scrape intervals across environments (dev often differs from prod).
- Audit dashboards: refresh intervals, time windows, number of panels (queries multiply).
- Audit alerts: evaluation frequency and window sizes (avoid repeated wide scans).