Failover a managed instance v1.1 to another zone

SOC2/CI-110

In MI v1.1, all resources are still zonal, and managed instances are not automatically resilient to zone-wide failures. It’s possible to failover resources to another zone in the event of GCP zonal failure, but it’s a tedious and manual process. Perform the failover in the following order.

Failover Cloud SQL

Use cases:

  • Cloud SQL current zone is down

Edit Cloud SQL location

Open sourcegraph/deploy-sourcegraph-managed and open $CUSTOMER/terraform.tfvars. Add the following override:

Replace the zone with whichever that is available in the same region at the moment.

cloud_sql_zone = "us-central1-a"

Apply the changes

terraform apply

It will take some time for Cloud SQL to be moved to a different zone.

Commit your change and open a Pull Request.

Failover the compute instance (VM)

Configure env var

eval (mg --customer $CUSTOMER workon)
export PROJECT_ID=$PROJECT_PREFIX-$CUSTOMER
export INSTANCE_NAME=default-$OLD_DEPLOYMENT-instance

Locate the most recent snapshot of the current data disk, note the name of the snapshot as SNAPSHOT_NAME. We are using a blue/green model for some infra changes, so it is possible to have snapshot of data disk of the previous instance. It’s important to use the snapshot of the last active instance.

gcloud compute snapshots list --project $PROJECT_ID

Follow the machine upgrade process to complete the failover while making the below changes

  • NEW_DEPLOYMENT instance should be created from the latest $SNAPSHOT_NAME
  • Change zone to a working zone
  • The GCP backend service resource needs to be temporarily modified to stop referencing the existing network endpoint resource, so it can be moved to a new zone