RDS snapshot retention policy: cost model and safe defaults
Snapshot retention is a trade-off between recovery objectives and cost. Most cost blowups come from long retention combined with high churn, plus manual snapshots that never expire.
1) Define what you actually need (RPO/RTO → retention)
- Operational recovery: typical restore windows (days to weeks).
- Compliance retention: long-term retention if required (months to years).
- RPO/RTO: how far back you must be able to restore, and how quickly.
If you can’t describe the use case for long-term retention (audit requirement, contract, policy), you probably don’t need it on every database.
2) Model cost with churn x retention
If churn is meaningful, backup storage tends to scale with daily changed GB x retention days.
Use a low and high churn scenario if you do not have strong measurements yet.
Related: Estimate backup GB-month.
3) Avoid the common retention traps (the “silent” costs)
- Same retention everywhere (dev/staging backups that linger for months).
- Manual snapshots without a lifecycle policy.
- Frequent snapshots for fast-changing datasets without cost guardrails.
- Cross-region copies that are never cleaned up after a project ends.
4) Use two tiers: short operational + targeted long-term
Keep the short tier for day-to-day recovery. Add long-term retention only where required and keep it scoped (critical databases, monthly snapshots, etc.).
- Example operational tier (prod): 7–14 days; (staging): 3–7 days; (dev): 1–3 days.
- Example long-term tier: monthly snapshots kept for 6–12 months, only for regulated or business-critical databases.
- Prefer explicit ownership: tag snapshots with owner/team and enforce lifecycle rules so “no owner” snapshots expire.
Cost guardrails (prevent retention from drifting)
- Set a monthly review: list snapshots by age and owner and delete anything that violates policy.
- Alert on backup storage growth (GB-month) by account/environment so drift is visible within days, not quarters.
- Require a reason for exceptions (long retention) and tie it to an audit ticket or compliance requirement so it can be revisited.
Validation checklist (don’t shorten retention blindly)
- Test restore workflows (PITR / snapshot restore) for the retention window you propose.
- Use Cost Explorer to compare backup storage GB-month before/after the policy change.
- Audit manual snapshots monthly (or automate cleanup) so they can’t accumulate indefinitely.
Next steps
Sources
Related guides
Estimate RDS backup storage (GB-month) from retention and churn
A practical method to estimate RDS backup storage (GB-month): start from daily changed data, retention days, and sanity-check with snapshot sizes. Includes common mistakes that inflate backup cost.
RDS vs Aurora cost: what to compare (compute, storage, I/O, and retention)
A practical RDS vs Aurora cost comparison checklist. Compare unit economics, scaling model, storage growth, backups/retention, and the workload patterns that change the answer.
AWS RDS cost optimization (high-leverage fixes)
A short playbook to reduce RDS cost: right-size instances, control storage growth, tune backups, and avoid expensive I/O patterns.
AWS RDS pricing (what to include)
A practical checklist for estimating AWS RDS costs: instances, storage, backups, I/O, and the line items that commonly surprise budgets.
RDS backups and snapshots (how to estimate cost)
A practical approach to estimating RDS backup and snapshot storage: retention, growth, and the biggest planning mistakes.
Aurora pricing (what to include): compute, storage, I/O, and backups
A practical checklist for estimating Aurora costs: instance hours (or ACUs), storage growth, I/O-heavy workloads, backups/retention, and the line items that commonly surprise budgets.
FAQ
What retention policy keeps costs predictable?
Use short operational retention (days to weeks) and keep long-term retention only where required. Model costs using churn x retention and validate with real snapshot growth.
Why do manual snapshots often create surprise bills?
Because they can accumulate without a lifecycle policy. Long-lived manual snapshots can quietly dominate backup GB-month over time.
Should every environment have the same retention?
Usually no. Prod often needs longer operational retention, while dev/staging can use much shorter retention to avoid paying for non-critical history.
How do I pick a safe default if I'm unsure?
Start with a modest operational retention window (for example, 7–14 days), implement a lifecycle policy for long-term retention, and validate restore needs with real incident and recovery data.
Last updated: 2026-01-27