Deployment

After you completed the Setup, we can start configuring and deploying your CoCalc Cloud instance.

This consists of a main chart in the /cocalc directory, and a couple of related sub-charts in /cocalc/charts/*.

To deploy your instance, you have to overwrite some global or chart-specific values. For that, maintain your own values file – which will be called my-values.yaml in the following. We recommend using Git to keep track of your changes.

Please start by studying these general instructions. On top of that, there are also more specific notes for various public cloud providers:

Your my-values.yaml

  • In /cocalc is a central values.yaml file. It defines the configuration for the sub-charts and some global values. All parameters are explained in detail inside that file as comments.

  • Feel free to check out the sub-directories in ./charts in case you want to know more about all details.

  • After familiarizing with that, create your own my-values.yaml file. This will overwrite the default values with the ones relevant for your setup using the -f my-values.yaml switch of helm (ref: Helm Install).

    • To overwrite values in sub-charts, write the values indented under their "sub-chart-name": section.

    • To define global values, list them in the global: section.

    For example, to configure the storage backend, the chart files are in /cocalc/charts/storage/, which means these settings come under storage:. There are also global storage settings used by other charts, which come under global.storage:.

    Learn more about HELM sub-charts and global values.

Note

Regarding YAML, global.storage in the text above means that these values come under the global: section in the my-values.yaml file. In that large indented block is a storage: section, which is indented even further. The actual values are defined inside that double-indented block.

Note

Feel free to copy cocalc-eu.yaml and use it as a starting point.

Configuration

Basics

  • Set the global.dns to your (sub) domain name.

  • You should start going through the Site Settings in global.settings section: give you site a nice name, etc.

  • Here, you’ll also have to tell the services how to connect to the database in global.database. If you need TLS, see Database TLS for more details.

  • Don’t forget to set global.imagePullSecrets if the secret is not regcred (see Docker registry).

  • Set global.setup_registration_token to restrict account creation – you probably want that.

  • Via global.setup_admin: {...} you define your initial admin account credentials.

  • Tweak the default resource requests and limits of projects by adjusting the global.settings.default_quotas parameters. See Architecture/Project for some context. Beyond that, Resource Management explains how to manage resource allocation for projects.

  • Peek into cocalc-eu.yaml to see how things are setup for that cluster.

Warning

Do not enable the SSH Gateway for the first installation deployment. It won’t work. Instead, make sure your cluster works well, and then enable it in a subsequent update.

Storage

Earlier, we discussed the possibilities to setup storage. Now, let’s see how to configure it!

By default, the HELM chart creates PersistentVolumeClaims for the data and software using the nfs StorageClass. You can configure the storage class and size of the PVCs in the values.yaml file. Look out for a section like this:

storage:
class: "nfs"
size:
  software: 10Gi
  data: 10Gi

Alternatively, you can create the PVCs yourself and use them. For that, set these two aspects:

  1. Don’t create them via the HELM charts, i.e. storage.create: false.

  2. Let CoCalc know about their names. E.g. if they’re called pvc-data and pvc-software, the relevant part in the config file would look like this:

storage:
  create: false

global:
  [...]
  storage:
    data:
      claimName: pvc-data
    software:
      claimName: pvc-software

The default names are projects-data and projects-software.

Example 1

If you bring your own NFS server and want to setup everything manually, you could follow the notes of setting up PV/PVC on Azure. This walks you through the steps of creating a PV and PVC for software and project data in two subdirectories. Then it deploys a small setup job, which creates these subdirectories and sets their proper ownership and permissions.

Example 2

As reference, on a cluster on Google’s GKE, the following PV, PVCs and StorageClasses exist:

$ kubectl get -o wide pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                      STORAGECLASS   REASON   AGE    VOLUMEMODE
pvc-58ada9eb-220d-46d9-ba5d-31ceb0c0fc45   10Gi       ROX            Retain           Bound    cocalc/projects-software                   nfs                     111d   Filesystem
pvc-ace52cd2-fb85-4fb9-96f7-19cd9575f5c2   20Gi       RWO            Retain           Bound    cocalc/data-nfs-nfs-server-provisioner-0   pd-standard             112d   Filesystem
pvc-ceae334d-8f0a-447b-b5c6-fcae6843a498   10Gi       RWX            Retain           Bound    cocalc/projects-data                       nfs                     111d   Filesystem

$ kubectl get -o wide pvc
NAME                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE    VOLUMEMODE
data-nfs-nfs-server-provisioner-0   Bound    pvc-ace52cd2-fb85-4fb9-96f7-19cd9575f5c2   20Gi       RWO            pd-standard    112d   Filesystem
projects-data                       Bound    pvc-ceae334d-8f0a-447b-b5c6-fcae6843a498   10Gi       RWX            nfs            111d   Filesystem
projects-software                   Bound    pvc-58ada9eb-220d-46d9-ba5d-31ceb0c0fc45   10Gi       ROX            nfs            111d   Filesystem


$ kubectl get -o wide storageclass nfs
NAME   PROVISIONER                                RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
nfs    cluster.local/nfs-nfs-server-provisioner   Retain          Immediate           true                   111d

$ kubectl get -o wide storageclass pd-standard
NAME          PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
pd-standard   pd.csi.storage.gke.io   Retain          Immediate           true                   111d

Note

  • The nfs storage class is created by the nfs-ganesha-server-and-external-provisioner helm chart. It uses a pd-standard disk to store the data.

  • projects-data and projects-software are provided by that NFS service.

  • There is no PV/PVC for a database, because this cluster uses GCP’s managed PostgreSQL service.

Advanced

Here is some context for more advanced configuration options.

General

  • global.registry is the upstream docker registry. It’s used for pulling images. You can change it to your own registry if you mirror all your images. Note: if you just customize the project’s software environment, then you have to change the manage.project.registry setting instead!

  • global.imagePullSecrets: see Docker registry.

  • global.setup_admin: see Admin Setup, e.g.:

    global:
      setup_admin:
          email: your.admin@email.address
          password: R3pLaC3mE
          name: "Your Name"
    

    You can also leave out the password and set HELM chart params on the command-line via helm [...] --set global.setup_admin.password=[password].

  • global.setup_registration_token: if your server is publicly available, you probably don’t want anyone to be able to create an account. This sets an initial token, that must be known to a user to be able to sign in. This does not affect SSO logins, because with them you’re already in control about who is allowed to get access. See Admin for more.

  • global.kubectl is the version tag string of an image. It’s used for running Jobs that need to run kubectl commands. The version should roughly match the version of the Kubernetes cluster’s API server you’re running.

  • global.ingress: this is used to populate the Ingress rules. Look at the letsencrypt/README.md file for more details. Obviously, this has to match whatever you have set up in Networking earlier.

  • global.networkingConfiguration allows you to disable all Ingress or NetworkPolicy rules. This is useful if you have a cluster with a different networking setup.

  • global.datastore: see Datastore.

  • global.priorityClasses and global.podDisruptionBudgets: if enabled, this will define PriorityClasses to make some Pods more important than others, while the Pod Disruption Budget defines how many pods of a replicated service can be interrupted during maintenance, cluster changes, etc.

Site Settings

Settings in global.settings can either be configured as Admin in Admin → Site Settings or via environment variables, which are picked up by the hub-websocket services. Setting them in your values file makes them read-only from the interface and hence you can track all changes explicitly in your Git repository.

In the future, there will be a more general Admin Guide. Settings are explained in the values.yaml file, though.

Storage

The storage: section is explained in Storage.

Project Pods

  • manage.timeout_pending_projects_min: All Manage services are responsible for starting and stopping project pods. If for some reason such a project pod is in the state Pending for too long, it will be killed. This is the timeout in minutes.

  • manage.project.tag is the default image tag, on which a project will run. It’s possible to customize the software environment in various ways, including changing the default project image. This is where the upstream project image tag will be set.

  • The Prepull service can be enabled or disabled via manage.prepull.enabled. Associated with it are the taints and labels for project nodes, which are defined in manage.project.dedicatedProjectNodesTaint and *.dedicatedProjectNodesLabel.

  • manage.mem_request_adjustment: basically a safeguard to avoid too large memory requests by project pods. Applied on the computed memory request of a project’s pod quotas – see Scaling/Projects for more context. The lim is a hard upper bound in MiB, pct is dynamic in percent of memory limit.

  • manage.timeout_pending_projects_min: projects stuck in “Pending” are deleted after this timeout.

  • manage.watchdog_stream_s: “manage” listens to a stream of changes coming from your Kubernetes API. If it is behind a LoadBalancer, this stream might be interrupted, without even knowing that it is. So, if there is no new data coming in for that time, the stream is restarted.

  • manage.project.serviceAccountName customize the ServiceAccount used by the project (if this isn’t set, it will not set it in the template and hence use the default).

  • Settings related to hub’s resources and multiple_replicas are explained in scaling frontend service.

  • Projects are prohibited from accessing internal services. That’s accomplished using a set of NetworkPolicy rules. If you want to allow projects to access internal services, you can set networkPolicy.allowProjectEgress to a list of rules. Check out the template file cocalc/templates/network-policies.yaml for more details about this.

  • manage.project.init: this is a setup script, which is sourced right before the local server of a project is started. It allows you to tune parameters, change environment variables, or call scripts from your compute environment. Use your powers wisely! The /cocalc/values.yaml file shows two configurable aspects. Tune them to your needs based on your experience:

    • Blobstore: this is a local store of data, which has been generated in Jupyter Notebooks – in particular, images. They are served from a small web-server for efficiency and e.g. TimeTravel uses them to show old plots, no longer part of the notebook. There are two implementations: an Sqlite3 based one (old, has been the only implementation) and a file-based store (new, now the default, especially made for NFS backed file-systems, which are known to cause problems with sqlite databases). You can selected the implementation by setting COCALC_JUPYTER_BLOBSTORE_IMPL=sqlite or disk. Disk is recommended and the default. The other tuning parameters for the disk based store are a size limit and number of files:

      • JUPYTER_BLOBSTORE_DISK_PRUNE_SIZE_MB=100: prune disk usage of jupyter blobstore

      • JUPYTER_BLOBSTORE_DISK_PRUNE_ENTRIES=1000: max number of files in that blobstore cache

    • Jupyter Kernel Pool: This is a mechanism to spin up one or more kernels without a notebook, to make them ready for use. E.g. when you restart a kernel, it will pick one from the pool and you see the notebook running without a delay. The tradeoff is memory usage. By default, one kernel will be in the pool. You can tune this as well:

      • COCALC_JUPYTER_POOL_SIZE=1: size of pool, set to 0 to disable it

      • COCALC_JUPYTER_POOL_TIMEOUT_S=900: after that time, clean up old kernels in the pool

      • COCALC_JUPYTER_POOL_LAUNCH_DELAY_MS=5000: delay before spawning an additional kernel

Software Environments

In a nutshell, the global.software dict defines default settings and a list of environments, where each one of them consists of a title, description, docker image tag and registry. Users can then choose from these environments when creating a new project, or changing the software environment of an existing project.

See Software Environment for more context about how to install additional software.

Single Sign On

Besides signing up via email/password credentials (optionally restricted by the setup_registration_token mentioned above), users can also sign in via SSO. This is configured in global.sso. Each entry in that dict is the name of an SSO provider. See values.yaml for a detailed break down. In a nutshell, you have to set the type and other general parameters in the conf section, while user-facing parameters like a name/icon are set in the info section.

Miscellaneous

  • hub.deleteProjectsIntervalH: there is a dedicated service to “unlink” a project, that has been marked as being deleted. The data itself is retained, though. You have to periodically check up in the database, which projects are deleted and remove the associated files.

  • hub.debug, manage.debug, manage.project.debug, …: tweak the $DEBUG variable used with Debug JS, in order to control how much information is logged to the Pods.

  • ssh-gateway.recent_projects: to make a project start automatically when connecting via the SSH Gateway, there must be a prepared entry for the project – a limitation of how this is implemented. This setting controls how many projects are considered as “recent” and hence are prepared for that. Increase this to e.g. 1 year, if you do not have many projects – i.e. more than 1000 per year. Otherwise, you have to start the project first, wait up to loop_delay_secs, and then try to connect via SSH.

Installation

After setting up your config file my-values.yaml, run

helm install cocalc ~/path/to/cocalc-cloud-helm/cocalc --timeout 20m -f my-values.yaml

in your own directory. What’s happening is a standard Helm installation. This means the HELM chart and sub-charts in the /cocalc directory is rendered and populated with all configuration parameters – your my-values.yaml parameters are merged into the default parameters. The resulting Kubernetes YAML config files are installed under the HELM deployment name cocalc in the current namespace.

Since pulling the images and running some setup jobs could take a while, the timeout is increased.

If you want to check the status on HELM’s side, you can run:

helm status cocalc

Note

If something goes horribly wrong, you can always uninstall the deployment via helm uninstall cocalc. Only caveat, if PV/PVCs have been created, they might not be deleted automatically.

Admin Setup

  • After the installation is complete, you should be able to access the CoCalc frontend via the DNS you specified in your my-values.yaml.

  • Use the credentials of your initial admin user you configured in your my-values.yaml to sign in. Change your admin password in the account settings; it won’t be overwritten.

  • I would also recommend to remove your credentials by changing that field back to global.setup_admin: {}. Or, you could use this to create a second admin account during your next update.

  • If you specified a global.setup_registration_token (which is highly recommended!) it will be setup initially. Open “Admin” → “Registration Token” to setup your own, or disable this one, etc.

More details about Admins, and how to create additional ones, are described in Admin.

Testing

Of course, first just check if the Kubernetes Pods are showing up and are running. See Troubleshooting for some ideas how to investigate problems.

Once the services are running, there are very basic tests to check if they return some information:

helm test cocalc

(where cocalc is the name of your CoCalc Cloud deployment). This starts a few Kubernetes Jobs and checks if they succeed.

Beyond that, a good end-to-end test is to

  • Open the website, sign in as admin, and go to your projects.

  • Open a project of yours, or create one.

  • Open or create a terminal.term file and run basic Linux commands like uptime or htop.

  • In that terminal, check if CoCalc related environment variables make sense: $ env | grep -i COCALC.

  • Open or create a Jupyter Notebook, select a popular kernel like “Python 3” and eval a cell with code like:

    import sys
    sys.version
    

    or:

    import pandas as pd
    print(pd.__version__)
    df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
    df.describe()
    
    • Similar for R, Octave and Sage (if installed).

    • For Sage, make sure evaluating code works. If it doesn’t, try running sage in a Terminal and if you get an “ILLEGAL INSTRUCTION” error, that means your hardware is too old for the Sage binary. Contact us.

  • Create a latex.tex document and check if it compiles.

  • Once some files are opened in your project, hit the refresh button of the browser. The files should still be there after reloading them, ready to be edited.

Updating

Make sure to update regularly!