Playbooks for incidents
Many of these tips require kubectl usage - you can refer to the deployment page’s Kubernetes section to help you get set up and started with basic commands.
CI playbook
Refer to the CI incidents playbook
Pausing a release
In some incidents, we may learn that we need to pause a release.
- The incident lead:
- Alerts @delivery-support, @cs, @ce so the teams can plan accordingly
- Handles the internal details of pausing a release (engaging the messenger to help with any relevant customer communication)
- The messenger alerts customers via a post on the status page
Go-to-market (license and subscription) issues
If a customer is experiencing an issue related to their license key or subscription status, any member of the Sourcegraph team has authority to generate a new, valid license key for any customer for any number of users that is valid for up to 7 days in the site-admin Subscriptions page on Sourcegraph.com. This will prevent the initial incident responder from being bottlenecked on a member of the go-to-market team that can validate the customer’s subscription status.
The incident responder will need to select a Sourcegraph.com account to attach the subscription to (typically the account should belong to the customer, so they can access the license key directly from their user profile, but in an emergency, the incident responder can use their own account in lieu of asking the customer), and can then manually generate a license key. No license “tags” are necessary.
If a customer’s instance is reporting “license expired” already, note that there is a 72hr grace period before non-admin users are locked out.
Figure out why a pod is in CrashLoopBackoff
CrashLoopBackoff
indicates that a service is repeatedly failing to start. Instead of using Google Cloud Console to check the logs, the best way to figure out why a pod is in this state is to use the following commands:
kubectl get pods -o=wide --namespace=prod # find the pod you are interested in
kubectl describe --namespace=prod pod $POD_NAME
In the output, you’ll find logs and a snippet like the following that can give more information about why a service has not started:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Message: 690
Certain states like Terminated
is not very clearly indicated in Google Cloud Console, and will usually not be properly indicated in logs either.
Restarting pods
This action is also referred to as “bouncing pods” or to “bounce a pod”. Follow the steps to authenticate kubectl and perform a restart using kubectl to restart the desired deployment:
kubectl rollout restart deployment/$SERVICE
Note the use of deployment/
—you cannot restart services (svc/
). Note that stateful sets have a different process.
You can also bounce pods by manually finding them via kubectl get pods
and kubectl delete pod $POD_ID
. Once a pod is deleted, it will automatically be recreated.
Making updates to stateful sets
Statefulsets are different to a deployment in that all pods must be in a healthy state before changes can be made.
Currently sourcegraph uses statefulsets for the following services:
- gitserver
- grafana
- indexed-search
In order to push an update to a failing statefulset take the following action:
1. Update the statefulset yaml
with the appropriate change and apply using:
kubectl apply -f <service.StatefulSet.yaml>
2. Delete the pods in the stateful set:
REPLICAS=`kubectl get sts -n prod <statefulset> -o jsonpath={.spec.replicas}`; for i in `seq 0 $[REPLICAS-1]`; do POD=<statefulset>-$i; echo "Deleting POD $POD"; kubectl delete pod -n prod $POD ; done
3. The statesfulset controller will now restart the pods. Verify the pods are starting successfully.
watch -n1 kubectl get all -n prod -l app=gitserver -o wide
4. Verify service is restored
Open the alert UI to click on the check URL that was failing and verify it’s now working again.
Useful dashboards
Metrics
Dashboards for Prometheus metrics are available at /-/debug/grafana
(for example, sourcegraph.com/-/debug/grafana).
Tracing
Jaeger is available at /-/debug/jaeger
(for example, sourcegraph.com/-/debug/jaeger).
Prod logs
- Go to Kubernetes Engine > Workloads, then search and click on the pod you’re interested, e.g.
sourcegraph-frontend
. - In the Deployment details page, there is a row called Logs, which has both Container logs and Audit logs.
- Click on the Container logs, then you should be redirected to the Logs Viewer page.
Free gitserver disk space on sourcegraph.com
1. See the status of all gitserver pods
kubectl get all -n prod -l app=gitserver -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
pod/gitserver-0 0/1 CrashLoopBackOff 12 1h 10.52.14.6 gke-dot-com-dot-com-93160456-s7zw <none>
pod/gitserver-1 1/1 Running 0 1h 10.52.13.189 gke-dot-com-dot-com-93160456-b7bk <none>
pod/gitserver-2 1/1 Running 0 1h 10.52.0.165 gke-dot-com-dot-com-93160456-271w <none>
2. Verify that it’s a disk space problem
Replace gitserver-0
with the name of the affected pod.
kubectl -n prod describe pod gitserver-0 | grep -A5 "Message:"
Message: t=2019-04-24T07:46:46+0000 lvl=info msg="Distributed tracing enabled" tracer=Lightstep
2019/04/24 07:46:46 failed to setup temporary directory: mkdir /data/repos/.tmp-old227473847: no space left on device
3. SSH into the node where the affected pod is scheduled
The node address is present in the output of the first command you ran: kubectl get all -n prod -l app=gitserver -o wide
. Replace the node name in the command below.
gcloud compute ssh gke-dot-com-dot-com-93160456-s7zw --zone us-central1-f
4. Become root and find the mount path containing repos-gitserver-
sudo su
df -h | sort -r -k 5 -i | head -n8
Filesystem Size Used Avail Use% Mounted on
/dev/sdc 3.9T 3.8T 68G 99% /var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/repos-gitserver-0-dot-com
/dev/root 1.2G 466M 756M 39% /
overlayfs 1.0M 172K 852K 17% /etc
tmpfs 1.0M 144K 880K 15% /var/lib/cloud
overlay 95G 8.2G 87G 9% /var/lib/docker/overlay2/ed84d3f946264457738a05ba00d586e8de53e731983a21b6825faa6fef01f068/merged
overlay 95G 8.2G 87G 9% /var/lib/docker/overlay2/d7ca574132a2511a82b3bbaa0010ee49e7e616a1ebee9ca61ad094d4aef286dd/merged
overlay 95G 8.2G 87G 9% /var/lib/docker/overlay2/d7bd7cc72a2e47cb32f2542bd7416597d4ad7a6409d0c069b041bf28766dea2a/merged
5. Delete repos from disk until Use% is below 98%
cd /var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/$DISK_NAME/github.com
for repo in *; do echo $repo; rm -rf $repo; sleep 0.1; df -h .; done
At the same time you run this command, check in a different terminal split or window the status of gitserver pods.
watch -n1 kubectl get all -n prod -l app=gitserver -o wide
6. Verify service is restored
Open the alert UI to click on the check URL that was failing and verify it’s now working again.
PostgreSQL database problems
We provide two sets of instructions here, shell commands and PostgreSQL commands to be run inside a psql
instance. PostgreSQL commands are denoted by the prompt sg=#
in this documentation; the actual prompt corresponds to the postgres user name.
Backing up & restoring a Cloud SQL instance (production databases)
Before any potentially risky operation you should ensure the databases have recent ( < 1 hour) backups. We currently have daily backups enabled.
You can create a backup of a Cloud SQL instance via gcloud sql backups create --instance=${instance_name} --project=sourcegraph-dev
or choose from existing backups via gcloud sql backups list --instance=${instance_name}
.
To restore a Cloud SQL instance to a previous revision you can use gcloud sql backups restore $BACKUP_ID --restore-instance=${instance_name}
You can also perform these commands from the Google Cloud SQL UI
🚨 You should notify the #dev-ops channel if an situation arises when a restore my be required. It should also be filed in our ops-incident log.
Restore database using point-in-time recovery (creates new database clone!)
🚨 You should open an incident (unless it is already opened) and notify the #dev-ops channel if a situation arises when a point-in-time recovery is required.
Explanation of CloudSQL PostgreSQL point in time recovery
⚠️ WARNING: Before starting restoring to point in time, ensure that previous step with restore via daily backup is not possible - point in time operation has to create a new database instance and all applications have to be reconfigured to use new instance!
You can create a clone of the instance from point in time via gcloud sql instances clone ${instance_name} ${clone_instance_name} --point-in-time ''
(UTC timezone for the source instance in RFC 3339 format).
You have to create a PR which modifies all CloudSQL Proxy conigurations to point to new instance in base folder of the repository
i.e. sourcegraph-dev:us-central1:${instance_name}
-> sourcegraph-dev:us-central1:${clone_instance_name}
After PR is approved and merged, applications are redeployed and use new instance.
Shell commands
These commands assume you’re on a local machine, and trying to access the live systems. Also refer to the deployment page’s Kubernetes section for kubectl tips.
Helpful aliases
Note: For the sourcegraph.com database you should refer to the instructions here
alias dbpod='kubectl get pods --namespace=prod | grep pgsql | cut -d " " -f 1'
alias proddb='kubectl exec -it --namespace=prod $(dbpod) -- psql -U sg -P pager=off';
Check load average
kubectl exec -it --namespace=prod $(dbpod) -- top
You might also like the shorter output of uptime
.
PostgreSQL interactions
The psql
command gives you, by default, a prompt looking like user=#
, where “user” is the provided user name; sg
for production. SQL queries and commands (SELECT
, UPDATE
, INSERT
, DELETE
, DROP
) need to be terminated by a semicolon. Some commands start with a backslash, and are not SQL commands, and do not need semicolons.
Examples:
sourcegraph=# select * from schema_migrations;
version | dirty
------------+-------
1528395540 | f
(1 row)
sourcegraph=# \d schema_migrations
Table "public.schema_migrations"
Column | Type | Modifiers
---------+---------+-----------
version | bigint | not null
dirty | boolean | not null
Indexes:
"schema_migrations_pkey" PRIMARY KEY, btree (version)
If you have entered an SQL command and gotten a prompt back immediately, check the character right before the #
. If you’re missing a semicolon, that character changes to -
. If you have mismatched parentheses, it changes to (
. Examples:
v-- this column here is really useful!
sourcegraph=# select (version, dirty
sourcegraph(# )
sourcegraph-# from schema_migrations
sourcegraph-# ;
row
----------------
(1528395540,f)
(1 row)
List slow queries
Note that typical queries usually run in milliseconds, not seconds. “1 minute” is far enough out to be highly unusual.
select pid, now() - pg_stat_activity.query_start AS duration, query, state from pg_stat_activity where (now() - pg_stat_activity.query_start) > interval '1 minute';
Cancel or terminate slow queries
Queries can be cancelled (requested to stop) or terminated (killed). You can only cancel a query which is running. Idle queries, which have completed but are waiting (typically on the client consuming the results or acknowledging them in some way), may not respond to a cancel, and require termination. (Under the hood: These are just sending signals. Cancel is SIGINT
, terminate is SIGTERM
.)
Cancel:
select pg_cancel_backend(pid), now() - pg_stat_activity.query_start AS duration, query, state from pg_stat_activity where (now() - pg_stat_activity.query_start) > interval '1 minute';
Terminate:
select pg_terminate_backend(pid), now() - pg_stat_activity.query_start AS duration, query, state from pg_stat_activity where (now() - pg_stat_activity.query_start) > interval '1 minute';
In each case, the column containing the results of the pg_*_backend(pid)
call will have a t
for successes and an f
for failures. Note that even a terminated process may not disappear instantaneously. (Failures should not happen if your database role is a superuser, which is the configuration we document and support, and is what we use on prod.)
Check/fix migration table
When migrations run, the schema_migrations
table is updated to show the state of migrations. The dirty
column indicates whether a migration is in-process, and the version
column indicates the version of the migration the database is on or converting to. On startup, frontend will abort if the dirty
column is set to true. (The table has only one row.)
If frontend fails at startup with a complaint about a dirty migration, a migration was started but not recorded as completing. This could mean that it actually failed. It could also mean that it completed, but took long enough that the readiness check killed frontend before the state was recorded. So it’s possible that one or more commands from the migration ran successfully.
Do not mark the migration table as clean if you have not verified that the migration was successfully completed.
To check the state of the migration table:
SELECT * FROM schema_migrations;
Typical output would look like:
version | dirty
------------+-------
1528395539 | t
(1 row)
This indicates that migration 1528395
was running, but has not yet completed. Check on the actual state of the migration directly; if it has completed, you can manually clear the dirty bit:
UPDATE schema_migrations SET dirty = 'f' WHERE version = 1528395539;
Checking on the status of the migration requires looking at the migration’s commands. The source for each migration is in sourcegraph/sourcegraph/migrations
, in a file named something like 1528395539_.up.sql
. The number indicates a migration serial number, and the text (usually empty in recent migrations) after the serial number describes the purpose of the migration. There should be a corresponding .down.sql
file to reverse the migration.
A full explanation of how to determine whether arbitrary SQL commands have successfully executed is beyond the scope of this document, but the most common/easy case is described below.
Describe a table and its indexes
Many migrations do nothing but create tables and/or indexes or alter them. You can get a description of a table and its associated indexes quickly using the \d
command (note lack of semicolon):
sg=# \d global_dep
Table "public.global_dep"
Column | Type | Modifiers
----------+---------+-----------
language | text | not null
dep_data | jsonb | not null
repo_id | integer | not null
hints | jsonb |
Indexes:
"global_dep_idx_package" btree ((dep_data ->> ('package'::text COLLATE "C")))
"global_dep_idxgin" gin (dep_data jsonb_path_ops)
"global_dep_language" btree (language)
"global_dep_repo_id" btree (repo_id)
Foreign-key constraints:
"global_dep_repo_id" FOREIGN KEY (repo_id) REFERENCES repo(id) ON DELETE RESTRICT
Using this information, you can determine whether a table exists, what columns it contains, and what indexes on it exist. This allows you to determine whether a given command ran successfully. For instance, if one of the commands in a migration was CREATE INDEX global_dep_idx_package [...]
, and you got the above output from \d
, that would indicate that the index was successfully created. If the index wasn’t present, that would indicate that the CREATE INDEX
had not been executed, and you would need to run that statement before clearing the dirty
flag in schema_migrations
.
Redis interactions
Connecting to the Redis databases in production
Find the Redis pods:
$ kubectl -n prod get pods | grep redis
redis-cache-8475866c76-swxlz 2/2 Running 0 22d
redis-store-7c5fb8dd89-jzx44 2/2 Running 0 22d
Get a redis-cli
prompt:
$ kubectl -n prod exec -it redis-store-7c5fb8dd89-jzx44 -- redis-cli
Defaulting container name to redis-store.
Use 'kubectl describe pod/redis-store-7c5fb8dd89-jzx44 -n prod' to see all of the containers in this pod.
127.0.0.1:6379>
DNS/SSL settings on Cloudflare
- Visit https://dash.cloudflare.com/login (get the credentials from 1Password)
- Click on “sourcegraph.com”
- Click on DNS and make sure the IP matches the IP from
kubectl get -nprod services
(after settings the K8s context withgcloud container clusters get-credentials main-cluster-5 --zone us-central1-f --project sourcegraph-dev
) - Click on Crypto in the top bar and make sure it’s set to Full
Rolling back Sourcegraph.com to an old frontend
image version
- Run
gcloud container clusters get-credentials main-cluster-5 --zone us-central1-f --project sourcegraph-dev
- Visit https://hub.docker.com/r/sourcegraph/frontend/tags/ to find an older tag
- Run
kubectl edit deployment -nprod sourcegraph-frontend
and find the line that looks likeimage: sourcegraph/frontend:22342_2018-10-30_a0ba521
and update the tag
Deleting a repository and its data
- Get pod names in production:
kubectl --namespace=prod get pods
- Connect to Postgres:
kubectl --namespace=prod exec -ti [pgsql pod name] -- psql -U sg
- Delete row:
DELETE FROM repo WHERE uri = 'github.com/example/example';
- Connect to gitserver-1:
kubectl --namespace=prod exec -ti [gitserver-1 pod name] -- sh
- Find and delete repo directory in
/data/repos/
if exists - Do steps 4 and 5 with gitserver-2 as well
Failing about.sourcegraph.com
Several of our static sites are hosted by Netlify at about.sourcegraph.com
. You can check the service status at https://www.netlifystatus.com/ as well as notifications to the #alerts-external channel.
If you discover an issue with Netlify but their status page is not displaying it, open a support request using the credentials in 1Password.
Sites:
about.sourcegraph.com
sourcegraph.com/about
-307->about.sourcegraph.com/about
sourcegraph.com/pricing
-307->about.sourcegraph.com/pricing
For a full list of sites redirected from sourcegraph.com/$KEY
reference the router code in our frontend.
Moving a GCE disk between projects
Moving a disk between projects requires multiple steps to ensure the disk is detached from its original source.
- The initial disk can only be created using the same type. To change its type, you will have to snapshot the disk and restore as a new type.
- The initial disks can only be moved to the same region as their source project.
Create a new disk in your target project from the old disk:
gcloud compute disks create ${TARGET_DISK_NAME} --source-disk=https://www.googleapis.com/compute/v1/projects/${SOURCE_PROJECT}/zones/${SOURCE_ZONE}/disks/${PVC_NAME} --project=${TARGET_PROJECT} --zone=${TARGET_ZONE} --type ${SOURCE_TYPE}
We now have to snapshot and restore the disk, to ensure the disk is unlinked from its source disk. This process is the same as its used to change a disk zone or type, please follow that playbook.
Changing a GCE disk zone or type
To change disk region or type, we need to first create a snapshot and then restore it to the new region or type.
Snapshot:
gcloud compute disks snapshot ${SOURCE_DISK_NAME} --project=${PROJECT} --zone=${SOURCE_ZONE} --snapshot-names ${SNAPSHOT_NAME}
At this point, you can optionally delete the source disk.
Restore:
gcloud compute disks create ${TARGET_DISK_NAME} --project=${PROJECT} --size ${TARGET_DISK_SIZE} --source-snapshot ${SNAPSHOT_NAME} --zone=${TARGET_ZONE} --type ${TARGET_DISK_TYPE}
TARGET_DISK_SIZE has to be equal or larger than the original source disk.
There is an alias command that performs the snapshot -> restore
operation automatically to switch regions, but it fails on big disks.
Changing Persistent Volume disk or other immutable parameters (PV replacement)
As changing i.e. gcePersistenDisk
or most PVs parameters are not allowed, steps to replace the PV with minimal downtime period:
-
Prepare new disk - use
Changing a GCE disk zone or type
paragraph above. -
Create new PV yaml file (sample here)
-
Apply new changes via opening, approving and merging a PR.
-
Check if new PV is available -
kubectl get pv -n prod
. -
Prepare PR with old PV removal, which should be deleted in the base folder.
-
Delete PV and PVC which should be replaced - via
kubectl
.
kubectl delete pv <PV-to-be-removed>
kubectl delete pvc <PVC-to-be-replaced> -n prod
-
Merge PR from step 5 - this will invoke a deploy again, to recreate the PVC via a buildkite job.
-
Verify that new PV is used (should be in
Bound
status):
kubectl get pv
Linking a PV/PVC to a GCE disk
When we need to re-link a disk to a PersistentVolume and its associated PersistentVolumeClaim because we are restoring a disks from a snapshot or for some other reasons, we will need to adapt the Kubernetes resource definitions to link the existing GCE disk to GKE.
Example PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: MyDisk
namespace: SomeNamespace
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
We will first ensure that the Claim is directly linked to an fixed Persistent Volume by setting the claimRef
to the desired PVC. You also need to ensure that the accessModes
, capacity
and all other settings match between the claim and the volume.
apiVersion: v1
kind: PersistentVolume
metadata:
name: MyDisk
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
claimRef:
name: MyDisk
namespace: SomeNamespace
storageClassName: SomeStorageClass
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: MyDisk
namespace: SomeNamespace
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
Lastly, we need to link the volume to a GCE disk by setting gcePersistentDisk
. We must ensure that gcePersistentDisk.pdName
matches the disk name in GCE.
apiVersion: v1
kind: PersistentVolume
metadata:
name: MyDisk
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
claimRef:
name: MyDisk
namespace: SomeNamespace
gcePersistentDisk:
fsType: ext4
pdName: SomeGCEDisk
storageClassName: SomeStorageClass
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: MyDisk
namespace: SomeNamespace
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
StatefulSets
There are some additional steps when dealing with an StatefulSet
as claims are not statically defined and we need to ensure the generated PersistentVolumeClaim
s match the static PersistentVolume
s.
First, discard the PersistentVolumeClaim
sections from the previous step, and inspect the volumeClaimTemplates
section of the StatefulSet
.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: MySet
namespace: SomeNamespace
[...]
spec:
[...]
replicas: 1
volumeClaimTemplates:
- metadata:
name: DataDisk
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: SomeStorageClass
Kubernetes will generate the PersistentVolumeClaim
name using the metadata.name
and spec.volumeClaimTemplates.metadata.name
using the following format ${spec.volumeClaimTemplates.metadata.name}-${metadata.name}-${replica-index}
and will set the claimRef.namespace
using metadata.name
.
Example PersistentVolume
using the previous volumeClaimTemplates
:
apiVersion: v1
kind: PersistentVolume
metadata:
name: DataDisk-0
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
claimRef:
name: DataDisk-MySet-0
namespace: SomeNamespace
gcePersistentDisk:
fsType: ext4
pdName: SomeGCEDisk
storageClassName: SomeStorageClass
Export/Import a Cloud SQL database
Exporting/Importing a database to Cloud SQL using gcloud
requires the export to existing in a Google Storage bucket and it can be performed across GCP projects.
Export:
gcloud --project=${SOURCE_PROJECT} sql export sql ${SOURCE_SERVER_NAME} 'gs://some-bucket/Cloud_SQL_Export_${SQL_SERVER_NAME}_${DB_NAME}' --database=${SOURCE_DB_NAME}
Import:
gcloud --project=${TARGET_PROJECT} sql import sql ${TARGET_SERVER_NAME} 'gs://some-bucket/Cloud_SQL_Export_${SOURCE_SERVER_NAME}_${SOURCE_DB_NAME}' --database=${TARGET_DB_NAME}
Sourcegraph.com is deleted entirely
This playbook has a dedicated page.