BigQuery cost estimation: storage, bytes scanned, and the dashboard trap
Analytics warehouses are predictable when you treat them as three independent problems: storage, bytes processed (scans), and data movement (streaming in and exporting out). Most teams miss by underestimating scans driven by dashboards and scheduled jobs.
0) Pick your cost drivers (what to measure)
- Stored data: average GB stored across the month (and growth).
- Bytes processed: TB scanned/day by queries, dashboards, and scheduled jobs.
- Ingestion/exports: streaming ingestion volume and any cross-region/internet export transfer.
1) Storage (GB-month) and growth
Storage is the baseline. Use average GB across the month, and model growth if you retain daily partitions.
Tool: Storage growth model.
- If you have daily partitions, average GB tends to rise unless you enforce retention and deletion.
- Separate raw tables from curated/derived tables; derived tables can double storage if you materialize aggressively.
2) Bytes processed (query scans)
For scan-priced analytics, the practical model is: TB/day scanned = sum over workloads(queries/day × TB/query). Keep baseline and peak separately.
Tool: Scan cost model (same math: GB scanned × $/GB).
- Dashboards: refresh interval × number of panels × time window can dominate TB scanned.
- Scheduled jobs: hourly/daily jobs are often the largest consistent scanner.
- Peak month: incident investigations and ad-hoc analysis can multiply query volume.
3) Partitioning and clustering (the biggest lever)
Most scan overages happen because queries read more data than intended. Partitioning and clustering are the primary levers to reduce bytes processed.
- Validate that common queries filter on the partition key (so partitions prune).
- Avoid wide time windows by default (30–90 day dashboards scan huge datasets repeatedly).
- If you materialize derived tables, ensure they are not re-scanned unnecessarily.
4) Streaming ingestion and exports
If you stream events in, model ingestion as its own volume line item. If you export results frequently (to object storage, to another region, to external systems), model export transfer separately.
Tools: Object storage, Egress.
Worked estimate template (copy/paste)
- Stored GB-month = average stored GB (include growth and derived tables)
- Scanned TB/day = dashboards + scheduled jobs + ad-hoc queries (baseline + peak)
- Streaming ingest = events/day × bytes/event (if applicable)
- Exports = GB/month moved out (cross-region/internet)
Common pitfalls
- Dashboards refreshing too frequently with wide time windows.
- Queries that do not prune partitions (scan explosion).
- Materializing derived tables without retention (storage doubles quietly).
- Not modeling ad-hoc investigations and incident analysis (peak month).
- Ignoring export/egress costs for large result sets.
How to validate
- List top 10 queries/dashboards and measure bytes processed per run.
- Validate refresh intervals and panel counts (queries multiply fast).
- Validate partition pruning and clustering behavior on common filters.
- Validate storage growth and retention (raw + derived tables).