Google GCP#

Prerequisites#

  • You should have a basic understanding of cloud computing, in particular the Google Compute Platform (GCP) and Kubernetes (K8S). But don’t worry, for setting up CoCalc you do not need to be an expert!

  • This guide is specific to GKE Clusters. If you want to use another cloud provider for your Kubernetes cluster, you have to adapt the instructions. For example, there are notes for AWS/EKS.

  • 2022-05: Specific to GKE on GCP, and starting with K8S 1.22 you need this plugin for authentication: google-cloud-sdk-gke-gcloud-auth-plugin

    • more info.

    • also, export USE_GKE_GCLOUD_AUTH_PLUGIN=True in ~/.bashrc.

  • If you’re not an owner or admin of the GCP project, you need a couple of “Admin” roles for your user. What exactly is hard to tell, and probably changes over time. What you need on top of just a Basic “Editor” is certainly: “Compute Admin”, “Compute Network Admin”, and “Compute Storage Admin”. Please talk with the owner of the GCP project to assign those roles to your “Editor” user.

Setup#

All the settings are recommendations – feel free to look into the settings in more detail. If something is actually required, it is mentioned. Often, you can change the settings later on as well.

The specific parameters are meant for a small cluster to get started. You can scale up later on, by changing the nodes types and CoCalc’s configuration parameters for the HELM charts.

Basics#

Let’s start. In the GCP Console under Kubernetes, you can create a new cluster:

  • Name: e.g. cocalc-1 (you can come up with whatever name you want)

  • Release channel: regular version track (not static)

  • As of writing this in 2023-02-26, the version is 1.24.9-gke.3200

  • Location: e.g. region europe-west3 and specifying europe-west3-b as node location. With that, all nodes will be in the same place.

  • Automation: maintenance window on Saturday + Sunday, starting at 00:00 for 6 hours.

Node Pools#

Now we add two node pools. That’s where CoCalc will be actually active. Two nodes pools are not strictly necessary, but it makes it easier to scale up and down. In particular, projects will run in one pool only, while all services are in the other pool.

The pool for services is of fixed size (e.g. 2 nodes), while the other pool of variable size is for the projects. Please read Architecture for more details about this. There is also room to deviate from exactly these settings – they are listed to give you an idea of what is necessary.

  • Service Pool:

    • name: “services-1” (if you change parameters, which you can’t edit, create a new pool and increment that number)

    • size: 2 (1 is not enough, unless you allow some services to run on project nodes as well)

    • surge update: max=1

    • image type: container optimized

    • Type: at least e2-standard-2.

    • 50gb standard disk

    • Security: secure boot (others leave as they are)

    • Metadata: Label: cocalc-role=services (key=value)

  • Projects Pool:

    • name: “projects-1”

    • size: 1

    • Spot VM: if you understand and can tolerate that spot VMs get randomly rebooted, and hence interrupt a running project, enable this – saves you a lot of money! Set the size to 2 as well.

    • Surge update: max=2 (temporarily more nodes)

    • Image type: container optimized

    • Machine type: e2-highmem-4 – this of course depends on what you really want to do. A “standard” project uses maybe around 0.5 gb ram and only a little bit of CPU (1/10 on average. Hence, you usually need more memory than cpu.)

    • 100gb balanced disk – the project images are huge, and having a faster disk speeds up downloading the image on a new node, and running programs in general. There is an optional “prepull” service, which loads the latest project image first, before the node is set to be available for projects.

    • Security: secure boot

    • Metadata:

      • Label: cocalc-role=projects

      • Taint: and to make the prepull service work, the initial taint must be set to these two (this is key=valueeffect):

        • cocalc-projects-init=falseNoExecute

        • cocalc-projects=initNoSchedule

  • Networking

    • Default, public cluster. Of course, if you know what you’re doing, you can also set up a private cluster and use a VPN or something like that. This is beyond the scope of this guide.

    • HTTP Load Balancing: enabled

    • Dataplane V2: enabled (this will take the network configuration files into account)

    • DNS: the default kube-dns is fine – unless you want to access internal services, then maybe you want to run Cloud DNS.

  • Security

    • Shielded GKE nodes: yes

  • Features

    • Enable Compute Engine persistent disk CSI Driver

    • Disable image streaming: I tried it when enabled, but maybe because those images are so large or other reasons, it didn’t really work. Rather, make sure to configure the “prepull” service. Also, this image streaming service will occupy some amount of memory. I think it’s better to use it for projects and disk caching.

    • Logging: yes, but only “System” … in particular, projects and hubs generate a lot of log lines, which end up maybe becoming expensive.

    • Cloud monitoring: yes, but only “System” (both cost money, so, being conservative here)

Database: Cloud SQL#

CoCalc requires a PostgreSQL database. We use a Cloud SQL instance for that. If you know what you’re doing, you can run the DB in the cluster yourself – there is nothing special about using Cloud SQL.

  • name: cocalc-db (choose whatever you want)

  • Postgres 14

  • Same region as the cluster

    • Maybe opt in to run with high availability. You can change this later.

  • Machine

    • It’s fine to start small: shared core, 1 vCPU, ~0.6 gb ram (or ~1.5gb).

    • Of course, check monitoring and adjust as needed!

  • Storage

    • SSD, 10gb, automatic storage increases

    • Keep in mind that the database stores all changes to documents. Therefore, the size increases with user activity. Said that, you probably won’t see the database to grow beyond a GB anytime soon.

  • Network:

    • Enabled: “private IP”

      • Had to enable service networking API (which requires to have the “Network Admin” role)

      • Selected to automatically allocate an ip range

    • Disabled “public ip” (that costs money, we don’t need it, and I assume this is way more secure anyways)

    • Hint: to access the DB, run the ../database/db-shell.sh script. It’s first argument must be the private IP address, then comes the database name and username. This script will start a small pod in the cluster, and connects to the instance from there. Hence, this script is also a test to check if you can connect to the DB from the cluster.

  • Backup: opt-in if you like, start small

    • midnight at 4am

    • region: same as the database

    • 7 days of backup

    • Point in time recovery: 1 day

  • Maintenance window:

    • Should be fine to set something, which is at night during the weekend: Sunday, 4-5 am. YMMV.

  • Flags: max_connections: 100. (With the default (due to low memory I guess) there weren’t enough slots. errors were: remaining connection slots are reserved for non-replication superuser connections

I guess you can change almost everything of the above later on as well.

Post setup:

  • Create user “cocalc” (or whatever you want) with a password. Save the password somewhere; we’ll later add it as a secret to the kubernetes cluster.

  • Create database “cocalc” (or whatever you want)

Cost Control#

The above cluster + associated services and resources incur costs. You can check up on that by going to: “Billing” (your billing account of your project) → Cost Management: “Reports”

  • You can see a daily graph of your usage, use the top-right above the chart dropdown to switch to “daily cumulative” to see a trend for the current billing period (for me, it’s a month).

  • On the right hand side, you can get more details by selecting “SKU” in the “Group by” selector. (“stock keeping unit” is the smallest part GCP is selling to you)

In the table below, click on “Cost ↓” to see them sorted in a decreasing way, or “Subtotal ↓”, after applying discounts & co.

  • What you should see is that the cluster itself costs something, but you get a credit for one. See notes here: > The GKE free tier provides $74.40 in monthly credits per billing account that are applied to zonal and Autopilot clusters. If you only use a single Zonal or Autopilot cluster, this credit will at least cover the complete cost of that cluster each month.

  • The LoadBalancer + external IP address also costs a rather fixed amount per month.

  • Logging costs proportionally to the data, hence we did disable everything except “System”.

  • If you use GCP’s “SQL” for running the PostgreSQL database, don’t use an external IP, since that would also cost you a fee to rent it.

  • The bulk of your cost are CPUs + Memory, though. See notes about “Spot VM” above for running the CoCalc projects on these.

  • Disk storage is rather cheap.

  • Egress Network traffic is the last item to think about. e.g. if your users watch a lot of videos by streaming them from the platform, you might end up getting charged significantly.

Storage#

We continue setting up the Cluster. So far, we have the “control plane” in GKE and some nodee. Now, we need to setup the storage.

  • Above in features, we enabled “Enable Compute Engine Persistent Disk CSI Driver” more info

  • The config files here will use this to setup suitable PVCs and Storage Classes.

  • The names of these PVC must match the references in the CoCalc deployment.

Run the following command to setup the storage classes:

kubectl apply -f pd-classes.yaml

NFS Server#

The goal is to setup an NFS storage provisioner, which uses the PVC “nfs-data” to store the data of projects, shared files and global data/software.

helm repo add nfs https://kubernetes-sigs.github.io/nfs-ganesha-server-and-external-provisioner/
helm repo update
helm search repo nfs

you should see nfs/nfs-server-provisioner in the output.

Now, look what nfs.yaml specifies, which will create a disk storing the data of all projects and shared files. Tune the config file to your needs!

helm upgrade --install nfs nfs/nfs-server-provisioner -f gke/nfs.yaml

NOTE: as of writing this, there was a problem with publishing newer docker images. Hence according to this ticket I had to add --version=1.5.0 for an oder variant of that chart.

Now:

kubectl get storageclasses

should list nfs.

Ref:

Note: completely independent of the above, you can use other storage solutions as well. For that, you have to create PVCs yourself, which will expose a ReadWriteMany filesystem. In the CoCalc deployment, you have to configure the names of these PVCs under global: {storage: {...}} and disable creating them automatically storage: {create: false}. See ../cocalc/values.yaml for more information.

Disk Backup#

Once you deployed the NFS server, you’ll notice a new disk. They’re listed in “Compute Engine” → “Disks”.

The simplest way to get some backup is to setup a “Snapshot schedule”. With that, GCP will make consistent snapshots of the disk, which you can restore from – or create a new disk from an older snapshot.

For that, go to “Compute Engine” → “Disks” → “pvc-(the uuid you see in kubectl get pv)” → “Edit” → “Create snapshot schedule”. Daily for two weeks sounds good.

BTW, that’s also the place where you can increase the disk size.

Next steps#

The next steps are to setup NGINX ingress + NodeBalancer. So, continue in the /ingress-nginx and /letsencrypt subdirectories.

You also have to setup the credentials for pulling from the private docker registry.

Once all this is done, you can configure and deploy the HELM Chart for CoCalc.

Testing#

  • First steps:

    • After the initial deployment, set the IP you see in the LoadBalancer (kubectl get svc → look for LoadBalancer with an external IP) at your DNS provider.

    • Then try to open https://[cocalc-your-domain.tld]/ in your browser.

    • You should be able to sign in directly as Admin, with the credentials set in your my-values.yaml config file. Of course, you should change your password.

  • Functionality:

    • A good test is to create a new project, and then open a terminal and run htop. You should see a script starting the project hub, a little bit of CPU activity, and not much more – maybe the sshd server for connecting via the SSH gateway.

    • Next, create some Jupyter Notebooks (Python3, R, …), create a LaTeX latex.tex file, and maybe some other files. Each one of these should work as expected.

    • Finally, explore your “Admin” panel, and see if the “Server Settings” are as expected. At the bottom you can test the email setup, by sending a password reset email.

    • As Admin, you can also create a file like data.cocalc-crm, which will allow you to look at various database tables, tie user activity to projects, etc.

Monitoring / Uptime Check#

The “uptime check” in GCP periodically pings your page.

Price: It has a free quota, hence we dial it down a bit to stay below it. Make sure to read about it’s pricing. E.g. 31 days, 3 ping locations every 5 minutes are: 31 * 24 * (60 / 5) * 3 = 26784. Well below the 1M free quota, as of writing this.

To get started very simple, you can setup something like that.:

  1. Open /monitoring/uptime/create in the GCP console to create a new uptime check

  2. Target:

    • HTTPS

    • URL (to check from the “outside” if everything is ok)

    • Hostname: the “DNS” entry

    • Path: keep it blank for “/” (i.e. “hub-next”). Other interesting targets are /stats (hub-websocket) or /static/app.html (static).

    • Check frequency: 5 minutes (that’s the 60/5 in the calculation above)

    • Expand the target options:

    • Regions: Just pick 3, not all of them.

    • GET Method on Port 443 & Validate SSL certificate!

  3. Validation:

    • Timeout 10s (or maybe better 30s, i.e. something is going on, but not a real issue yet?)

    • Content matching: here you need to get creative. Maybe check for a small string in the content, e.g. the custom name of our instance, or <html> for static.

    • No logging (just adds up the logging quota, I guess)

    • Response code 2xx

  4. Alerts:

    • Name: “[your instance name] is down”

    • Duration: 5 Minutes (?)

    • Notification: here, you have to select how to notify, there is a whole setup behind this. At minimal, it should send you an email.

At the very end is a “Test” button. Check that it actually says that the page is up, before arming it :-) The first time around it might take a bit longer to respond, subsequent tests should be quicker – next.js warmed up.

Then click “create”, of course. All of the above can be changed later as well…

Note

Since the above just checks paths at certain domains, you can setup the same at another service as well.