CloudWatch Logs Insights cost optimization (reduce GB scanned)

Logs Insights pricing is usually a function of GB scanned. To reduce cost, you reduce how many bytes your queries scan. The highest leverage fixes are almost always about query shape and log organization, not about deleting logs blindly.

Step 0: measure before optimizing

Identify the top 10 queries by GB scanned (dashboards + ad-hoc investigations).
Measure scanned GB/day for a representative week, including incident days.
Identify which log groups drive scanning (noisy vs high-signal).

Logs Insights cost Logs Insights pricing

High-leverage levers

Time-bound every query: avoid “last 30 days” defaults; start with 15–60 minutes and expand only if needed.
Filter early: restrict by service, path, status code, or request id before expensive parsing and sorting.
Split noisy logs: keep debug/verbose logs separate so you don’t scan them during normal operations.
Reduce repeated scans: dashboards should query small windows; incident playbooks should avoid scanning “all time” repeatedly.
Make queries reusable: save “golden queries” that are scoped and time-bounded to avoid ad-hoc broad scans.

Query patterns that usually scan less

Start narrow: 15 minutes, one log group, one service, then widen if the signal is missing.
Filter before parsing: restrict by a simple substring or status code early, then parse JSON fields.
Split debug logs: keep verbose logs in separate groups so normal queries don’t scan them.
Use “top N” sparingly: sorting large ranges can encourage bigger scans; scope first.

Example query shape (pseudocode):

fields @timestamp, @message
| filter @message like /ERROR/
| filter service = "payments"
| sort @timestamp desc
| limit 50

Operational guardrails

Dashboards: default to short windows (15m/1h) and avoid auto-refresh on long ranges.
Incident playbooks: prefer “narrow then widen” steps instead of repeated “search everything”.
Ownership: track who owns the top scanning dashboards and review them quarterly.

Common cost traps

Wide time ranges: scanning 7 days vs 1 hour can be a 168× jump in scanned GB.
Dashboards as scanners: auto-refresh dashboards can re-scan the same large dataset all day.
Unbounded search patterns: “find error anywhere” across all groups becomes expensive at scale.
Noisy log groups: high-volume success logs dominate scan volume for simple questions.

Validation checklist

For every “optimized query”, compare scanned GB before vs after.
Confirm dashboards are time-bounded and do not scan long ranges on auto-refresh.
Confirm noisy logs are separated or sampled so routine queries scan high-signal data.
After changes, validate incident workflows still work (debug ability preserved).