How to reduce logging and observability costs (without losing signal)
Observability bills usually grow because volume and cardinality grow silently. The highest-leverage reductions come from removing low-value data before it’s ingested, while keeping the high value signals (errors, latency, key business events) reliable for debugging.
What to measure first (so you optimize the right thing)
- Log ingestion: GB/day by source (service, endpoint, environment)
- Retention: days kept per log group / category
- Query/scan: how much you scan/search (this can be a separate bill)
- Metrics series: time series count and top high-cardinality labels
High-impact levers (priority order)
- Source-side filtering: drop noisy logs before shipping (debug/info for hot paths, health checks).
- Sampling: keep full fidelity for errors; sample high-volume success logs.
- Retention strategy: short retention for noisy logs, longer for sparse high-value logs.
- Format hygiene: smaller structured logs (avoid huge JSON payloads, duplicate fields).
- Cardinality hygiene: stop exploding metric labels (userId, requestId, full URL, stack traces as labels).
1) Reduce volume at the source (the “saves everywhere” fix)
- Turn off debug logs by default; use timeboxed flags for temporary debugging.
- Drop low-value endpoints: health checks, static asset requests, frequent polling.
- Collapse repetitive logs: “request started” + “request finished” can be merged or sampled.
- Prefer counters and metrics for high-frequency events instead of logs.
2) Sample without losing signal
Sampling is safe when you keep the important paths unsampled:
- Never sample: errors, warnings, security events, deployment markers.
- Sample: successful requests on hot endpoints, verbose debug/info logs.
- Keep traces: consider tracing for deep investigations instead of logging full payloads.
3) Retention tiers (keep what matters longer)
- Hot: short retention for high-volume logs you rarely query.
- Warm: medium retention for operational investigations.
- Archive: long retention only for compliance and sparse high-value events.
A common mistake is uniform retention across everything; it turns low-value noise into a steady storage bill.
4) Fix the “incident multipliers”
- Retry storms: retries multiply both traffic and logs; fix timeouts and backoff.
- Deploy storms: temporary error spikes can create huge log bursts.
- Bot/noise: abusive traffic creates large access log volume with little business value.
5) Metrics cardinality hygiene (avoid the silent series explosion)
- Don’t label metrics with high-uniqueness fields (requestId, userId, full path, stack trace).
- Bucket values (status family, route template, tenant tier) instead of raw identifiers.
- Audit top label keys and explicitly ban the worst offenders.
Validate in a “before/after week” report
- Ingestion GB/day by top sources (did the intended source actually drop?)
- Scan/query volume (did you reduce searchable data without breaking investigations?)
- Error detection (do you still catch and diagnose incidents fast?)
- Metrics series count (did cardinality stabilize?)
Related guides
Sources
Related guides
AWS CloudWatch Metrics Pricing & Cost Guide
CloudWatch metrics cost model: custom metrics, API requests, dashboards, and retention.
CloudWatch Logs pricing: ingestion, retention, and queries
A practical CloudWatch Logs pricing guide: model ingestion (GB/day), retention (GB-month), and query/scan costs (Insights/Athena). Includes pitfalls and a validation checklist.
CloudWatch Logs Insights cost optimization (reduce GB scanned)
A practical playbook to reduce CloudWatch Logs Insights costs: measure GB scanned, fix query patterns, time-bound dashboards, and avoid repeated incident scans.
CloudWatch metrics cost optimization: reduce custom metric sprawl
A practical playbook to reduce CloudWatch metrics costs: control custom metric cardinality, right-size resolution, reduce API polling, and validate observability coverage.
API Gateway access logs cost: how to estimate ingestion and retention
A practical guide to estimate API Gateway access logs cost: estimate average bytes per request, convert to GB/day, model retention (GB-month), and reduce log spend safely.
CloudFront logs cost: estimate storage, retention, and queries
How to estimate CloudFront log costs: log volume (GB/day), retention (GB-month), and downstream query/scan costs (Athena/SIEM). Includes practical cost-control levers.
Related calculators
Log Cost Calculator
Estimate total log costs: ingestion, storage, and scan/search.
Log Ingestion Cost Calculator
Estimate monthly log ingestion cost from GB/day or from event rate and $/GB pricing.
Log Retention Storage Cost Calculator
Estimate retained log storage cost from GB/day, retention days, and $/GB-month pricing.
Log Search Scan Cost Calculator
Estimate monthly scan charges from GB scanned per day and $/GB pricing.
Metrics Time Series Cost Calculator
Estimate monthly metrics cost from active series and $ per series-month pricing.
CloudWatch Metrics Cost Calculator
Estimate CloudWatch metrics cost from custom metrics, alarms, dashboards, and API requests.
FAQ
What usually drives logging cost spikes?
Ingestion volume and cardinality. Common causes are debug logs left on, noisy endpoints, retry storms during incidents, high-cardinality labels in metrics, and verbose access logs.
Is sampling safe?
Sampling is safest when you keep full fidelity for errors and important events, and sample low-value logs (debug/info) at the source. Validate that your core investigations still work.
What’s the best first lever?
Source-side filtering. Dropping low-value logs before ingestion reduces ingestion, retention, and query costs together.
How do I quantify savings quickly?
Estimate current ingestion GB/day and retention days. Model a reduction factor (e.g., -30% ingestion) and compare monthly costs before/after.
Last updated: 2026-01-27