How to resolve a “Sourcegraph.com is deleted entirely” incident
Assess in which way it is deleted entirely
- Navigate to the
sourcegraph-dev
project and look at the existing Kubernetes clusters. Does thecloud
cluster exist still?- No, the
cloud
cluster is gone:- Do the disks for the now-deleted cluster nodes still exist? Check by navigating to Compute -> Disks and searching for
pDName
search- Yes, the disks still exist: Go to Recreating GKE cluster and follow the with existing disks steps.
- No, the disks are gone: Go to Recreating GKE cluster and follow the from snapshots steps.
- Do the disks for the now-deleted cluster nodes still exist? Check by navigating to Compute -> Disks and searching for
- Yes, the cloud cluster exists: Go to Recreating Kubernetes objects
- No, the
Recreating GKE cluster
We use Terraform to manage our infrastructure
- Navigate to the
cloud
repo - Follow the instructions there to run
terraform plan
to see if the infrastructure has drifted from what is specified there. - Run a
terraform apply
to reconcile the infrastructure to its definition in code. - With existing disks, goto recreate the Kubernetes objects:
- From snapshots, goto restore the disks from snapshots
- Go to Confirm health of Sourcegraph.com
Recreating Kubernetes objects
- Navigate to the
cloud
cluster on the Google Cloud console and clickConnect
, run the `gcloud command it gives you. kubectl -n prod get deployments
should show partial or no Kubernetes deployments, but that you are connected to the right cluster.- In the https://github.com/sourcegraph/deploy-sourcegraph-cloud repository’s latest
release
branch, runkubectl-apply-all.sh
which will recreate all Kubernetes objects.
- Sourcegraph.com uses static disk attachments, so the volumes should still be valid and no data should have been lost.
Go to Confirm health of Sourcegraph.com
Restore disks from snapshots
-
We use Velero to manage our disaster recovery process.
-
Navigate to the
cloud
cluster on the Google Cloud console and clickConnect
, run thegcloud
command it gives you. -
Ensure you have Velero installed locally (
brew install velero
) -
Check to see if the
velero
namespace exists.kubectl get ns velero
-
If it does not, you need to install and configure Velero.
gcloud config set project sourcegraph-dev SERVICE_ACCOUNT_EMAIL=$(gcloud iam service-accounts list \ --filter="displayName:Velero service account" \ --format 'value(email)') gcloud iam service-accounts keys create credentials-velero \ --iam-account $SERVICE_ACCOUNT_EMAIL velero install \ --provider gcp \ --plugins velero/velero-plugin-for-gcp:v1.4.0 \ --bucket sg-velero-cloud-backup \ --secret-file ./credentials-velero
-
Following the velero restore documents steps. a. First, patch the backup location
kubectl patch backupstoragelocation default \ --namespace velero \ --type merge \ --patch '{"spec":{"accessMode":"ReadOnly"}}'
b. Find the most recent backup with
velero backup get
and runvelero restore create --from-backup <BACKUPNAME>
c. Finally, revert the accessModekubectl patch backupstoragelocation default \ --namespace velero \ --type merge \ --patch '{"spec":{"accessMode":"ReadWrite"}}'
Confirm health of Sourcegraph.com
- Check that
kubectl -n prod get pods
shows all pods as healthy and starting. - Check that sourcegraph.com is accessible and you can run searches.
- Check that https://sourcegraph.com/site-admin shows a large number of users, repositories, etc. indicating postgres data exists.
- Check that the following are online and working:
- Check that:
- Regular expression searches like this are working.
type:symbol errorf
worksrepo:/sourcegraph/sourcegraph$ type:symbol index:no errorf
works- hover, jump-to-definition, find-references work
- OpsGenie shows no more open alerts
- https://sourcegraph.com/-/debug/grafana shows no unexpected alerts
Follow the documented regular incident follow-up procedures.