How to reduce logging and observability costs (without losing signal)

Observability bills usually grow because volume and cardinality grow silently. The highest-leverage reductions come from removing low-value data before it’s ingested, while keeping the high value signals (errors, latency, key business events) reliable for debugging.

What to measure first (so you optimize the right thing)

  • Log ingestion: GB/day by source (service, endpoint, environment)
  • Retention: days kept per log group / category
  • Query/scan: how much you scan/search (this can be a separate bill)
  • Metrics series: time series count and top high-cardinality labels

High-impact levers (priority order)

  1. Source-side filtering: drop noisy logs before shipping (debug/info for hot paths, health checks).
  2. Sampling: keep full fidelity for errors; sample high-volume success logs.
  3. Retention strategy: short retention for noisy logs, longer for sparse high-value logs.
  4. Format hygiene: smaller structured logs (avoid huge JSON payloads, duplicate fields).
  5. Cardinality hygiene: stop exploding metric labels (userId, requestId, full URL, stack traces as labels).

1) Reduce volume at the source (the “saves everywhere” fix)

  • Turn off debug logs by default; use timeboxed flags for temporary debugging.
  • Drop low-value endpoints: health checks, static asset requests, frequent polling.
  • Collapse repetitive logs: “request started” + “request finished” can be merged or sampled.
  • Prefer counters and metrics for high-frequency events instead of logs.

2) Sample without losing signal

Sampling is safe when you keep the important paths unsampled:

  • Never sample: errors, warnings, security events, deployment markers.
  • Sample: successful requests on hot endpoints, verbose debug/info logs.
  • Keep traces: consider tracing for deep investigations instead of logging full payloads.

3) Retention tiers (keep what matters longer)

  • Hot: short retention for high-volume logs you rarely query.
  • Warm: medium retention for operational investigations.
  • Archive: long retention only for compliance and sparse high-value events.

A common mistake is uniform retention across everything; it turns low-value noise into a steady storage bill.

4) Fix the “incident multipliers”

  • Retry storms: retries multiply both traffic and logs; fix timeouts and backoff.
  • Deploy storms: temporary error spikes can create huge log bursts.
  • Bot/noise: abusive traffic creates large access log volume with little business value.

5) Metrics cardinality hygiene (avoid the silent series explosion)

  • Don’t label metrics with high-uniqueness fields (requestId, userId, full path, stack trace).
  • Bucket values (status family, route template, tenant tier) instead of raw identifiers.
  • Audit top label keys and explicitly ban the worst offenders.

Validate in a “before/after week” report

  • Ingestion GB/day by top sources (did the intended source actually drop?)
  • Scan/query volume (did you reduce searchable data without breaking investigations?)
  • Error detection (do you still catch and diagnose incidents fast?)
  • Metrics series count (did cardinality stabilize?)

Related guides

Sources


Related guides


Related calculators


FAQ

What usually drives logging cost spikes?
Ingestion volume and cardinality. Common causes are debug logs left on, noisy endpoints, retry storms during incidents, high-cardinality labels in metrics, and verbose access logs.
Is sampling safe?
Sampling is safest when you keep full fidelity for errors and important events, and sample low-value logs (debug/info) at the source. Validate that your core investigations still work.
What’s the best first lever?
Source-side filtering. Dropping low-value logs before ingestion reduces ingestion, retention, and query costs together.
How do I quantify savings quickly?
Estimate current ingestion GB/day and retention days. Model a reduction factor (e.g., -30% ingestion) and compare monthly costs before/after.

Last updated: 2026-01-27