Cloud

The Cloud team is the special focus team reporting directly to CEO modeled on “if AWS were to offer ‘Managed Sourcegraph’ like they do Elasticsearch, Redis, PostgreSQL, etc., how would they do it?” The team is responsible for maintaining existing managed instances and building the next generation of them. The Cloud team has no other responsibilities.

Members

Rafal Leszczynski, Engineering Manager, Cloud
- Joe Chen, Software Engineer
- Dax McDonald, Software Engineer
- Robert Lin, Software Engineer
- Rafal Gajdulewicz, Software Engineer
- Filip Haftek, Software Engineer
- Daniel Dides, Software Engineer
- Michael Lin, Software Engineer

Mission statement

Build a fully managed platform for using Sourcegraph that can (by EOFY23) support 200+ customers using dedicated Sourcegraph instances, providing feature compatibility with self-hosted while being cost-efficient for customers and Sourcegraph.

Fully managed

Observability allowing Sourcegraph to react before user impact is noticed, while respecting user privacy
Frequent, invisible Sourcegraph upgrades
Invisible infrastructure updates
Zero infrastructure access for customers

Platform

Low customer onboarding cost
Zero customer maintenance cost
Secure (SOC 2, documented security posture)
Reliable (ability to offer SLA, internal SLO of 99.9%)
Automatable (in due time, feature releases / billing / upgrades / analytics are built-in)

Support 200+ customers

Targeting 200+ customers in FY23 to invest in supporting 1000 in FY24
- Support 300 production-grade instances (accommodating trials / testing)
Compatible with current MI use cases
- Infrastructure / Domain / Isolation boundary per customer

Dedicated Sourcegraph instances

One Sourcegraph instance serves a single customer
(FY23/FY24) Dedicated, Sourcegraph-provided Cloud infrastructure
(FY23/FY24) GCP only

Feature compatibility

Feature set on-par with self-hosted
- With time, getting more powerful than self-hosted
Features are opt-in (for a fee)
New features available on Cloud before self-hosted
Existing features have higher adoption on Cloud than self-hosted

Cost-efficiency

Expected to support teams from 50 to 5000 users (EOFY23) at 500$/month minimal infrastructure cost
Infrastructure cost covered by Sourcegraph
Administration / operations provided by Sourcegraph
(FY24) Self service provisioning / release channels for upgrades

Not in scope (for FY23 / FY24):

supporting customer provided GCP infrastructure
supporting cloud providers other than GCP
managing Sourcegraph installations in clusters not provisioned by the Cloud team (Bring-your-own-Kubernetes)
supporting customers smaller than X1 ARR
optimizing cost below X2 $/month

Roadmap

The Cloud team roadmap in available here.

Q3FY23 goals

Support the initiative to make the Cloud a preferred deployment method from the platform and infrastructure perspective
Cloud v2 - migrate current the Cloud (managed instances) from single-VM, Docker Compose based architecture to multi-node, GKE based architecture

Team

Rafal Leszczynski, Engineering Manager, Cloud
- Joe Chen, Software Engineer
- Dax McDonald, Software Engineer
- Robert Lin, Software Engineer
- Rafal Gajdulewicz, Software Engineer
- Filip Haftek, Software Engineer
- Daniel Dides, Software Engineer
- Michael Lin, Software Engineer

How to contact the team and ask for help

For emergencies and incidents, alert the team using Slack command /genie alert [message] for cloud.
For internal Sourcegraph teammates, join us in #cloud slack channel to ask questions or request help from our team.
For special requests types or requests for help that requires action for the Cloud team engineers (exp. coding, infrastructure change etc.) please create a GH issue and assign a team/cloud label. You can also post a follow up message on the #cloud slack channel

When to offer a Managed Instance

See below for the SLAs and Technical implementation details (including Security) related to managed instances.

Please message #cloud for any answers or information missing from this page.

When offering customers a Managed Instance, CE and Sales should communicate and gather information for the following topics

Customers are comfortable with security implication of using a managed instance
Customers’ code host should be accessible publically or able to allow incoming traffic from Sourcegraph-owned static IP addresses. (Notes: we do not have proper support for other connectivity methods, e.g. site-to-site VPN)

Trial Managed Instances (aka PoC)

Documentation

Managed Instance Requests

Customer Engineers (CE) or Sales may request to:

Create a managed instance - [Issue Template]
- For new customers or prospects who currently do not have a managed instance.
- After determining a managed instance is viable for a customer/prospect
Suspend a managed instance - [Issue Template]
- For customers or prospects who currently have a managed instance that needs to pause their journey, but intend to come back within a couple of months.
Tear down a managed instance - [Issue Template]
- For customers or prospects who have elected to stop their managed instance journey entirely. They accept that they will no longer have access to the data from the instance as it will be permanently deleted.
Extend trial Managed Instance issue - [Issue Template]
- For prospects who needs to extend the trial.
Convert Trial Managed Instance to paid issue - [Issue Template]
- For prospects who sign the deal after trial expires.
Enable telemtry on a managed instance - [Issue Template]
- For customers or prospects who currently do have a managed instance and you would like to enable collection of user-level metrics.
Disable telemtry on a managed instance - [Issue Template]
- For customers or prospects who currently do have a managed instance and you would like to disable collection of user-level metrics.

Workflow

CE seeks Managed Instance approval from their regional CE Manager
The Regional CE Manager will review the following criteria:
- Overall, is the deal qualified?
- Is it technically qualified? We have documented POC success criteria and the customer agrees to the criteria. We have documented the basic technical requirements of the customer (languages, repo types, security, etc.)
- If anything is non-standard, it must pass the tech review process
If approved, then CE proceeds based on whether this is a standard or non-standard managed instance scenario:
- For standard managed instance requests (i.e., new instance, no scale concerns, no additional security requirements), CE submits a request to the Cloud team using the corresponding issue template in the sourcegraph/customer repo.
- For non-standard managed instance requests (i.e., any migrations, special scale or security requirements, or anything considered unusual), CE submits the opportunity to Tech Review before making a request to the Cloud team.
Message the team in #cloud.
If denied, the CE/AE can appeal through the CE/AE leadership chain of command.

SLAs for managed instances

Support SLAs for Sev 1 and Sev 2 can be found here. Other engineering SLAs are listed below

	Description	Response time	Resolution time
New instance Creation	Spin up new instance for a new customer	Within 24 hours of becoming aware of the need	Within 7 working days from agreement
New Trial instance Creation	Spin up new trial instance for a new customer	1 hours within office hours: Monday to Friday - 7:00 AM GMT - 10:00 PM GMT or 1 working day if requested outside of working hours	1 hours within office hours: Monday to Friday - 7:00 AM GMT - 10:00 PM GMT or 1 working day if requested outside of working hours
Existing instance suspension	Suspend an existing managed instance temporarily	Within 24 hours of becoming aware of the need	Within 15 working days from agreement
Existing instance deletion/teardown	Decommission/delete and existing managed instance	Within 24 hours of becoming aware of the need	Within 15 working days from agreement
New Feature Request	Feature request from new or existing customers	Within 24 hours of becoming aware of the need	Dependent on the request
Maintenance: Monthly Update to latest release	Updating an instance to the latest release	NA	Within 1 week after latest release
Maintenance: patch/emergency release Update	Updating an instance with a patch or emergency release	NA	Within 1 week after patch / emergency release

Recovery Time Objective and Recovery Point Objective (RTO & RPO)

We have a maximum Recovery Point Time objective of 24 hours. Snapshots are performed at-least daily on managed instances. Some components may have lower RPOs (e.g. database).

Our maximum Recovery Time Objective is defined by our support SLAs for P1 & P2 incidents.

Incident Response

Incidents which affect managed instances are handled according to our incidents process.

Accessing/Debugging Managed Instances

Action	Who can do it	Description	How
Reload config	CE/CS	Reload MI site config (restart frontend)	restart frontend
View GCP project metrics	Cloud/Security/All SG employees via policy attachment	Access to all MI metrics aggregate in single project	GCP scoped dashboard
View GCP project logs	Cloud/Security/All SG employees via policy attachment	Access customer GCP project logs	GCP logs - change to proper customer name
GCP ssh, tunnel ports	Cloud/CS	Required for troubleshooting customer environment and perform pre-defined playbook	install mg cli ssh to MI port-forward to MI gcloudcommands
Access CloudSQL database	Cloud/Security/CS	Login to CloudSQL DB	install mg cli access CloudSQL via mg cli gcloud commands
Login to customer MI web UI	Cloud/CE	Login to customer web UI (requires enabled OIDC on customer instance or access to `1password customer instances vault`) - change URL to customer slug	login with GSuite (OIDC) or user/password from 1password (if OIDC not enabled)
Login to customer Grafana	Cloud/CE	Login to customer Grafana (requires enabled OIDC on customer instance or access to `1password customer instances vault`) - change URL to customer slug	login with GSuite (OIDC) or user/password from 1password (if OIDC not enabled)
List Managed Instances	Cloud/CE	List Managed Instances, filtered by instance type (trial/production/internal) and (optionally) by responsible CE	list Managed Instances

More Managed Instances can be found here

How we work

Cloud launch process

Issue tracking

The Cloud team GitHub Project is the single source of truth.

How we use GitHub Projects (Beta)

Grooming and Estimation process

On-call

We maintain an on-call rotation in Opsgenie. Responsibilities of the teammate who is on-call include:

Acknowledging incoming alerts
Initiating incident procedures
Publishing postmortems

Managed Instance technical documentation

Team slack channels

#cloud - external channel for the Cloud team where other Sourcegraphers can ask for help or leave questions for the team
#cloud-internal - internal channel for the Cloud team for all day to day communication within the team

FAQ

FAQ: Can customers disable the “Builtin username-password authentication”?

Yes, you may disable the builtin authentication provider and only allow creation of accounts from configured SSO providers.

However, in order to preserve site admin access for Sourcegraph operators, we need to add Sourcegraph’s internal Okta as an authentication provider. Please reach out to our team prior to disabling the builtin provider.

FAQ: How do I restart the frontend after changing the site-config?

Are you a member of our CE & CS teams?

Visit sourcegraph/deploy-sourcegraph-managed
Locate the slug of the customer instance from list of folders
Visit https://github.com/sourcegraph/deploy-sourcegraph-managed/actions/workflows/reload_frontend.yml
Click Run workflow and input the slug of customer instance
Click the Run workflow green button
Done! It shouldn’t take more than 2 minutes

FAQ: What are Cloud plans for observability - can I see data from customer instances in Honeycomb / Grafana Cloud / X?

Cloud instances provisioned for customers provide the same monitoring data / tooling as all other Sourcegraph instances (Grafana/Prometheus for metrics, Jaeger for traces). GCP Logging is used to store / query logs written by Sourcegraph workloads, and GCP Monitoring is used for infrastructure-level metrics / uptime checks.

Access to data from Cloud instances is governed by Cloud Access Control Policy.

Long-term, we will collaborate with DevX team (as owners of Sourcegraph observability) to support monitoring / observability solutions that are qualified for use with customer data.

FAQ: What are Cloud plans for continuous deployment - how often do we deploy code to Cloud instances?

Cloud instances provisioned for customers run released Sourcegraph versions and are currently updated at least once a month (for minor releases), unless we need to deploy a patch release.

Sourcegraph-owned instances are continuously deployed (with versions that weren’t officially released), DevX team owns continuous deployment to those environments.

FAQ: What are Cloud plans for analytics - where can I see data from Cloud instances in Looker / Amplitude?

Cloud instances do not expose analytics data other than pings. Future work in this area is owned by Analytics team and managed through the “Improve our data collection” cross-functional project.

FAQ: Does Cloud support data migrations?

Cloud instances are created without any customer data (repos / code-host connections / code / user accounts / code insights etc.), and Cloud team does not support importing customer data from self-managed / jointly-managed / Cloud-managed Sourcegraph instances.

FAQ: How to use mg cli for Managed Instances operations?

git clone https://github.com/sourcegraph/deploy-sourcegraph-managed
cd deploy-sourcegraph-managed
echo "export MG_DEPLOY_SOURCEGRAPH_MANAGED_PATH=$(pwd)" >> ~/.bashrc
mkdir -p ~/.bin
export GOBIN=$HOME/.bin
echo "export PATH=\$HOME/.bin:\$PATH" >> ~/.bashrc
source ~/.bashrc
make install
mg --help

FAQ: How do I generate a password reset link for customer admin?

For #cloud engineers, run mg reset-customer-password -email <> and it will generate a 1password share link for you.

The password reset link expires after 24h, so it’s quite common that CE would have to generate a new link during the initial hand-off process.

If access to the instance is restricted, either via VPN or CIDR whitelist, please reach out to #cloud for assistance.

Otherwise, the CE responsible for the customer is added as site-admin, so CE can login with “Sourcegraph Employee” (Google Workspace) auth provider and reset customer admin password. Otherwise, please reach out to #cloud for assistance.

IMPORTANT: Please do not share the password reset url directly with the customer admin over email or slack. More context.

Open 1password, and create a new Secure Note item and paste the password reset url, then use the 1password share item feature to securely share the link with customer admin. Make sure you configure the following options while sharing the item:

Link expires after: 1 day
Available to: <insert customer admin email>

This ensures only the customer admin is able to gain access to the password reset url.

FAQ: I have a new feature I want to deploy to Cloud, how do I do that?

Read through our Cloud Cost Policy

FAQ: What are Cloud plans for analytics - where can I see data from Cloud instances in Looker / Amplitude?

Cloud instances do not expose analytics data other than pings. Future work in this area is owned by Analytics team and managed through the “Improve our data collection” cross-functional project.

FAQ: How to list trial, production or internal instances?

You can either use:

Github Action (ce email parameter is optional).
mg cli via command:

mg info --ce <NAME>@sourcegraph.com --instance-type [trial|production|internal] (both parameters are optional)