BigQuery cost estimation: storage, bytes scanned, and the dashboard trap

Analytics warehouses are predictable when you treat them as three independent problems: storage, bytes processed (scans), and data movement (streaming in and exporting out). Most teams miss by underestimating scans driven by dashboards and scheduled jobs.

0) Pick your cost drivers (what to measure)

  • Stored data: average GB stored across the month (and growth).
  • Bytes processed: TB scanned/day by queries, dashboards, and scheduled jobs.
  • Ingestion/exports: streaming ingestion volume and any cross-region/internet export transfer.

1) Storage (GB-month) and growth

Storage is the baseline. Use average GB across the month, and model growth if you retain daily partitions.

Tool: Storage growth model.

  • If you have daily partitions, average GB tends to rise unless you enforce retention and deletion.
  • Separate raw tables from curated/derived tables; derived tables can double storage if you materialize aggressively.

2) Bytes processed (query scans)

For scan-priced analytics, the practical model is: TB/day scanned = sum over workloads(queries/day × TB/query). Keep baseline and peak separately.

Tool: Scan cost model (same math: GB scanned × $/GB).

  • Dashboards: refresh interval × number of panels × time window can dominate TB scanned.
  • Scheduled jobs: hourly/daily jobs are often the largest consistent scanner.
  • Peak month: incident investigations and ad-hoc analysis can multiply query volume.

3) Partitioning and clustering (the biggest lever)

Most scan overages happen because queries read more data than intended. Partitioning and clustering are the primary levers to reduce bytes processed.

  • Validate that common queries filter on the partition key (so partitions prune).
  • Avoid wide time windows by default (30–90 day dashboards scan huge datasets repeatedly).
  • If you materialize derived tables, ensure they are not re-scanned unnecessarily.

4) Streaming ingestion and exports

If you stream events in, model ingestion as its own volume line item. If you export results frequently (to object storage, to another region, to external systems), model export transfer separately.

Tools: Object storage, Egress.

Worked estimate template (copy/paste)

  • Stored GB-month = average stored GB (include growth and derived tables)
  • Scanned TB/day = dashboards + scheduled jobs + ad-hoc queries (baseline + peak)
  • Streaming ingest = events/day × bytes/event (if applicable)
  • Exports = GB/month moved out (cross-region/internet)

Common pitfalls

  • Dashboards refreshing too frequently with wide time windows.
  • Queries that do not prune partitions (scan explosion).
  • Materializing derived tables without retention (storage doubles quietly).
  • Not modeling ad-hoc investigations and incident analysis (peak month).
  • Ignoring export/egress costs for large result sets.

How to validate

  • List top 10 queries/dashboards and measure bytes processed per run.
  • Validate refresh intervals and panel counts (queries multiply fast).
  • Validate partition pruning and clustering behavior on common filters.
  • Validate storage growth and retention (raw + derived tables).

Related tools

Sources


Related guides

Artifact Registry pricing (GCP): storage + downloads + egress (practical estimate)
A practical Artifact Registry cost model: stored GB-month baseline, download volume from CI/CD and cluster churn, and outbound transfer. Includes a workflow to estimate GB-month from retention and validate layer sharing and peak pull storms.
Bigtable cost estimation: nodes, storage growth, and transfer (practical model)
A driver-based Bigtable estimate: provisioned capacity (node-hours), stored GB-month + growth, and network transfer. Includes validation steps for hotspots, compactions, and peak throughput that force over-provisioning.
Cloud Spanner cost estimation: capacity, storage, backups, and multi-region traffic
Estimate Spanner cost using measurable drivers: provisioned capacity (baseline + peak), stored GB-month (data + indexes), backups/retention, and multi-region/network patterns. Includes a worked template, common pitfalls, and validation steps.
Cloud SQL pricing: instance-hours, storage, backups, and network (practical estimate)
A driver-based Cloud SQL estimate: instance-hours (HA + replicas), storage GB-month, backups/retention, and data transfer. Includes a worked template, common pitfalls, and validation steps for peak sizing and growth.
GCP Cloud Storage Pricing & Cost Guide
Understand Cloud Storage cost drivers: storage class, operations, retrieval, and egress with estimation steps.
Google Kubernetes Engine (GKE) pricing: nodes, networking, storage, and observability
GKE cost is not just nodes: include node pools, autoscaling, requests/limits (bin packing), load balancing/egress, storage, and logs/metrics. Includes a worked estimate template, pitfalls, and validation steps to keep clusters right-sized.

FAQ

What usually drives analytics warehouse costs?
Bytes scanned (or slot usage) is often the biggest variable, while storage is the steady baseline. Inefficient queries, missing partition pruning, and dashboards refreshing frequently cause most surprises.
How do I estimate quickly?
Estimate stored GB and retention for storage, then estimate scanned TB/day for queries and scheduled jobs. Keep streaming ingestion and exports as separate line items.
What is the most common cost mistake?
Treating 'one dashboard' as free. Dashboards refreshing every minute can run 1,440 scans/day and dominate bytes processed.
How do I validate?
Validate partitioning/clustering and ensure common queries prune partitions. Validate dashboard refresh rates and scheduled jobs, and measure bytes processed for the top queries.

Last updated: 2026-01-27