Restore a managed instance v1.1 when it is completely gone
SOC2/CI-110
Are you experiencing a zonal failure? Follow the failover process
In MI v1.1, we now have resources running outside the VM, i.e Cloud SQL, hence the restoration comes with multiple stages. Perform the restoration in the following order.
Restoring Cloud SQL
Use cases:
- Cloud SQL data is corrupted by a broken database migration
- Cloud SQL data is deleted
Restore from automated backup
Below process is derived from GCP documentation
The restortion process will be performed with gcloud
. Learn more about why not terraform?.
Locate the SQL instance, note the name of the instance as SQL_INSTANCE
gcloud sql instances list --project $PROJECT_ID
List all backups, note the name of the latest (or the one right before database state is corrupted) SUCCESSFUL backup as SQL_BACKUP_ID
gcloud sql backups list --instance $SQL_INSTANCE --project $PROJECT_ID
Restore the backup to the current instance.
gcloud sql backups restore $SQL_BACKUP_ID --restore-instance $SQL_INSTANCE --project $PROJECT_ID
Restore the compute instance (VM)
Use cases:
- Docker daemon state is corrupted
- Application data is deleted
- VM is deleted
Assess what is deleted
Navigate to sourcegraph-managed-$CUSTOMER
project, and look at existing compute instance.
Does the VM still exist?
-
No, the VM is gone.
- Does the data disk
default-$CURRENT_DEPLOYMENT-data-disk
still exists?
- Does the data disk
-
Yes, follow operation guides to troubleshoot services condition. If unable to recover running services on the VM, fallback to restore snapshot on a live VM.
Re-create the VM with existing data disk
- Open sourcegraph/deploy-sourcegraph-managed and check out to the
$CUSTOMER
directory - Run
terraform apply
to reconcile the infrastructure to its definition in code. - Follow confirm instance health
Re-create the VM with new data disk from disk snapshot
-
Run
gcloud compute snapshots list --project=sourcegraph-managed-$CUSTOMER --sort-by="~creationTimestamp" --limit=5 --format="table(name,creationTimestamp)"
and copy the name of the latest snapshot -
Go to sourcegraph/deploy-sourcegraph-managed and create a new branch
$CUSTOMER/restore-instance
-
cd $CUSTOMER
-
Edit
$CUSTOMER/terraform.tfvars
. NOTES: the key could beblack
depending on the current active instancedisks = { red = { from_snapshot = "REPLACE_ME_WITH_SNAPSHOT_NAME" } }
-
Run
terraform apply
to reconcile the infrastructure to its definition in code. -
Follow confirm instance health
-
Commit your changes and open a Pull Request
Restore snapshot on a live VM
-
Run
gcloud compute snapshots list --project=sourcegraph-managed-$CUSTOMER --sort-by="~creationTimestamp" --limit=5 --format="table(name,creationTimestamp)"
and copy the name of the latest snapshot -
Go to sourcegraph/deploy-sourcegraph-managed and create a new branch
$CUSTOMER/restore-instance
-
cd $CUSTOMER
-
Edit
$CUSTOMER/terraform.tfvars
. NOTES: the key could beblack
depending on the current active instancedisks = { red = { from_snapshot = "REPLACE_ME_WITH_SNAPSHOT_NAME" } }
-
Run
terraform apply
twice -
Run
gcloud compute instances stop default-$OLD_DEPLOYMENT-instance --project $PROJECT_ID
-
Run
gcloud compute instances start default-$OLD_DEPLOYMENT-instance --project $PROJECT_ID
-
Follow confirm instance health
-
Commit your changes and open a Pull Request