Skip to content

Cloud SQL Scaling and Reliability

Scaling, backup, high availability, and monitoring guidance for Cloud SQL Postgres beyond the baseline template configuration.

The template provisions a minimal Cloud SQL instance (db-custom-1-3840, Enterprise edition, private IP only, IAM database auth). This document covers what to change as your workload grows, and when each option becomes worth the cost.

Baseline Configuration

The template's database.tf creates:

  • Instance tier: db-custom-1-3840 (1 shared vCPU, 3.75 GB RAM)
  • Edition: Enterprise
  • Availability: Zonal (single zone, no automatic failover)
  • Backups: Daily at 03:00 UTC, 7-day retention, point-in-time recovery enabled
  • Maintenance window: Sunday 06:00 UTC, stable update track (offset from backup window)
  • Connection pooling: None (application connects directly through Auth Proxy sidecar)
  • Networking: Private IP only, enforced Auth Proxy, enforced TLS, IAM database auth

All scaling options below are additive changes to database.tf. The template does not include them because the right configuration depends on your workload, budget, and availability requirements.

Instance Tier Scaling

When to upgrade: CPU or memory utilization consistently exceeds 70%, query latency increases, or connection count approaches the tier's limit.

Tier Options

Cloud SQL uses custom machine types with the format db-custom-{vCPUs}-{memoryMB}:

Tier vCPUs RAM Max Connections Approximate Monthly Cost
db-custom-1-3840 (baseline) 1 (shared) 3.75 GB ~100 ~$50
db-custom-2-7680 2 7.5 GB ~200 ~$100
db-custom-4-15360 4 15 GB ~400 ~$200
db-custom-8-30720 8 30 GB ~800 ~$400

Note

Costs are approximate for us-central1 with Enterprise edition. Actual costs vary by region and sustained use discounts. Check Cloud SQL pricing for current rates.

Connection limits are approximate. Cloud SQL imposes its own per-tier limits below the theoretical maximum (RAM / ~10 MB per connection) due to reserved memory for system processes, shared buffers, and background workers.

How to Change

Update the tier field in your database.tf:

settings {
  edition = "ENTERPRISE"
  tier    = "db-custom-2-7680"  # 2 vCPUs, 7.5 GB RAM
  # ...
}

Recommendation: Start by upgrading the instance tier before adding connection pooling or high availability. Tier upgrades are the simplest scaling lever and typically complete in under 5 minutes (brief restart required).

Automated Backups

The template enables daily backups with 7-day retention and point-in-time recovery (PITR) by default. PITR uses write-ahead logs to restore to any point within the retention window.

Backup and Maintenance Window Scheduling

The backup window (start_time) defines the start of a 4-hour window during which the backup begins. The template sets backups at 03:00 UTC and maintenance at 06:00 UTC Sunday to avoid overlap — maintenance involves a brief restart (~5–10 minutes, <30s connectivity loss for Enterprise edition) that could interrupt a backup in progress. Google's documentation does not explicitly address this interaction, but the maintenance overview states that "maintenance is canceled if an instance operation, such as an export, is ongoing" and advises to "ensure that no other instance operations are planned when maintenance is scheduled." Whether an automated backup qualifies as an "instance operation" in this context is unstated — offsetting the windows avoids the question entirely. For large databases where backups may exceed the 4-hour window, consider increasing the offset.

Adjusting Retention

The template's 7-day retention covers most workloads. To increase retention for production, update database.tf:

Update the backup_configuration block inside settings in your database.tf:

settings {
  # ... existing settings ...

  backup_configuration {
    enabled                        = true
    point_in_time_recovery_enabled = true
    start_time                     = "03:00"

    # Increase retention for production
    transaction_log_retention_days = 14
    backup_retention_settings {
      retained_backups = 14
    }
  }
}

Cost impact: Backup storage is billed at the standard Cloud SQL storage rate. PITR retains transaction logs, which adds storage proportional to write volume. For a low-traffic session database, expect minimal additional cost.

Retention trade-offs:

  • 7 days (template default): Sufficient for most workloads. Covers accidental deletes or corruption discovered within a week.
  • 14 days: Better safety margin for issues discovered late. Recommended for production.
  • Longer retention: Increases storage cost linearly. Rarely needed for session data — consider database exports for long-term archival instead.

High Availability

When to enable: Production environments where downtime is unacceptable. Do not enable for dev or staging unless you are testing HA failover behavior.

Regional high availability creates a standby instance in a different zone within the same region. If the primary fails, Cloud SQL automatically promotes the standby. Failover typically completes in under 60 seconds.

How to Enable

Change availability_type in your database.tf:

resource "google_sql_database_instance" "sessions" {
  # ... existing config ...

  settings {
    availability_type = "REGIONAL"  # default is "ZONAL"
    # ... existing settings ...
  }
}

Warning

Regional HA approximately doubles the instance cost because Cloud SQL runs a full standby replica. A db-custom-1-3840 instance goes from ~$50/month to ~$100/month.

What HA covers:

  • Zone-level outages (hardware failure, zone maintenance)
  • Instance crashes (automatic restart on standby)
  • Planned maintenance (minimal downtime with maintenance windows)

What HA does not cover:

  • Data corruption (use backups for this)
  • Region-level outages (use cross-region read replicas if needed)
  • Application-level errors (bad queries, accidental deletes)

Application impact: Failover causes a brief connection interruption. The Auth Proxy sidecar reconnects automatically. ADK's DatabaseSessionService uses pool_pre_ping=True (auto-set for non-SQLite), which validates connections before use and discards stale ones. No application code changes are needed.

Managed Connection Pooling

When to enable: When autoscaling Cloud Run to many concurrent instances (roughly 10+) causes connection exhaustion. Not needed for single-instance or low-scale deployments.

Each Cloud Run instance runs its own Auth Proxy sidecar, and each sidecar opens a separate connection pool to Cloud SQL. With ADK's default SQLAlchemy settings (pool_size=5, max_overflow=10), each instance can open up to 15 connections. At 10 Cloud Run instances, that is 150 connections — which may exceed the tier's limit.

Prerequisites

Managed connection pooling has specific requirements:

  • Cloud SQL Enterprise Plus edition (not Enterprise) — higher base cost
  • Cloud SQL Auth Proxy >= 2.15.2 — earlier versions do not support the pooling endpoint
  • Compatible with IAM database auth — pooling is transparent to the application, which continues connecting to localhost:5432 through the Auth Proxy

How to Enable

  1. Upgrade to Enterprise Plus edition in database.tf:
settings {
  edition = "ENTERPRISE_PLUS"
  tier    = "db-custom-2-16384"  # Enterprise Plus requires minimum 2 vCPUs, 16 GB RAM
  # ...
}
  1. Enable connection pooling via the Cloud SQL console or Terraform:
settings {
  # ... existing settings ...

  managed_connection_pooling {
    enabled = true
  }
}
  1. No application code changes needed. Managed connection pooling is transparent to the application and proxy configuration. Continue connecting to localhost:5432.

Important

Enterprise Plus edition has a significantly higher base cost than Enterprise. Evaluate whether upgrading the instance tier (more connections per instance) is sufficient before switching editions. For many workloads, a db-custom-4-15360 on Enterprise (~$200/month) handles more connections than a minimum Enterprise Plus instance.

Decision Framework

Use this sequence to address connection scaling:

  1. Upgrade instance tier — cheapest, simplest. Increase RAM to support more connections.
  2. Tune SQLAlchemy pool settings — reduce pool_size and max_overflow in application code if connections are underutilized.
  3. Enable managed connection pooling — when tier upgrades are no longer cost-effective or you need 50+ Cloud Run instances.

Monitoring

Track these Cloud SQL metrics in Cloud Monitoring to anticipate scaling needs before they become incidents.

Key Metrics

Connections:

  • cloudsql.googleapis.com/database/postgresql/num_backends — active connection count. Compare against the tier's max connections. Alert at 70% utilization.
  • cloudsql.googleapis.com/database/network/connections — total connection attempts including failed ones. Spikes indicate connection exhaustion.

CPU:

  • cloudsql.googleapis.com/database/cpu/utilization — CPU usage as a fraction (0.0 to 1.0). Sustained values above 0.7 indicate a tier upgrade is needed.
  • cloudsql.googleapis.com/database/cpu/reserved_cores — number of vCPUs reserved. Useful for confirming tier configuration.

Memory:

  • cloudsql.googleapis.com/database/memory/utilization — memory usage fraction. PostgreSQL uses memory for shared buffers, connection overhead, and sort/hash operations. Alert at 80%.
  • cloudsql.googleapis.com/database/memory/usage — absolute bytes used.

Disk:

  • cloudsql.googleapis.com/database/disk/utilization — disk usage fraction. Cloud SQL auto-grows storage by default, but monitor to avoid surprises.
  • cloudsql.googleapis.com/database/disk/write_ops_count — write IOPS. High values may indicate insufficient disk throughput.

Replication (if using HA):

  • cloudsql.googleapis.com/database/replication/replica_lag — lag between primary and standby in seconds. Should be near zero under normal operation.

Alerting Recommendations

Create Cloud Monitoring alert policies for production:

Metric Condition Action
CPU utilization > 0.7 for 15 minutes Evaluate tier upgrade
Memory utilization > 0.8 for 15 minutes Evaluate tier upgrade
Connection count > 70% of tier max for 5 minutes Check for connection leaks, evaluate scaling
Disk utilization > 80% Review storage growth, consider cleanup
Replica lag (HA only) > 10 seconds for 5 minutes Investigate replication health

Sources


Back to References | Documentation