Resource Management#

At this point, we assume your cluster is running and there are a couple of projects started by users. However, some users ask you for more resources for their projects, or you want their projects to run on a dedicated subset of all nodes.

Licenses: defines an upgrade schema, that can be applied to one or more projects by your users, via applying a license to their project.
Quick’n’dirty: directly upgrade a project, without creating a license.
Heterogeneous nodes: distribute projects in a heterogeneous cluster, where some nodes are dedicated to specific users or groups.
GPU nodes: tell certain projects to run on nodes with GPUs.

Licenses#

Creating “Licenses” is a way to define the resource request for a specific project. For that, please open your “Admin” panel and expand the “Licenses” section. Then, click “Create license” to see a form for configuring a new license. (If you already have licenses, you can search for them and modify their parameters.)

Title: give it a name, it will be used to identify the license in the UI. This and the description will help your users to understand what this license is about to do.
License manager: search for the account of one or more users, who should see that license. They will be able to select it for assigning it to their projects (otherwise, they’ll have to know the ID)
Run limit: how many actively running projects this license can upgrade at the same time. If this is a course, and the license is distributed via the course management configuration, that limit must be at least the number of students, because each student has their own project.
Activates/Expires: define start and end dates – if no end date is given, it’s a perpetual license.
Quota: here, you can set resource parameters for the project Pod and other details:
- Higher CPU and Memory limits. The associated resource requests will be computed based on the overcommit ratio specified in global.settings.default_quotas settings parameter.
- Always Running: this is a neat feature for users, because it keeps some of their precious projects around. If you enable this, the project will be restarted, if it is stopped by the user or was on a node that has been decommissioned. This is useful for long-running calculations, or just to make the files immediately accessible without having to wait for the project to start up again. Also, the state of running session like in a Jupyter Notebook are not deleted.
Additionally, there are three special quotas for on-prem setups:
- ext_rw: gives the project read/write access to the /ext mountpoint – see Projects Software.
- dedicated_vm: allow the project to run on separate tainted nodes – see Heterogeneous nodes.
- patch: see Patching Projects
Please do not use “Upgrades”. This is a legacy feature.

Note

If licenses are compatible, more than one can be active and the quotas add up. The overall limit is defined in global.settings.max_upgrades.
To keep things simple, advice your users to use only one license per project.
You can modify an existing license, which avoids users having to change the applied licenses.
To expire a license, change the expiration to be in the past. This will become effective, when the project is restarted.

Quick’n’dirty#

Besides the structured approach of creating and distributing Licenses, you can also jump in with your powers as an Admin and directly upgrade a project. For that, ask for the project’s UUID and then open https://<your-domain>/projects/<UUID>/settings in your browser (which opens that project’s settings). Then click the “Admin Quotas…” button in the “Project usage and quotas” section.

This reveals a panel, where you can set base upgrades for the project. They are complementary to the upgrades given by a license, i.e. they do not add up. Any changes require a restart of the project.

Of particular interest is probably raising memory (“Shared RAM”), increasing the “Idle Timeout”, or even setting it to “Always Running”.

Heterogeneous nodes#

Note

This feature was added in version 2.11.0.

Imagine, you have several workgroups and they want to run their projects on their own dedicated set of nodes. Possible motivations are:

They want to have a certain amount of (possibly very large) resources available at all times,
They want to have a certain type of hardware, e.g. with GPUs,
They pay for the specific hardware and want to make sure, that only their projects run on it.

CoCalc Cloud is a single system, but you can partition your cluster in such a way, that some machines are dedicated to specific users or groups.

The idea is to add taints and labels, with a specific “name”, to certain nodes in your cluster – as explained below. Then create a license, which encodes the name of these dedicated machines and resource quotas.

Note

Important: The taint and label must be named in the same way and compatible with the kubernetes naming schema. Also, don’t get confused with the “node names”!

Decide on a name for your group of one or more Dedicated VM(s) – this is the common [taint-name].
All nodes in your kubernetes cluster, which should be part of this group, have their own distinct node name: [node-name].

With that, for each node the following must be set:

kubectl taint nodes [node-name] cocalc-dedicated_vm=[taint_name]:NoSchedule
kubectl label nodes [node-name] cocalc-dedicated_vm=[taint_name]

e.g. if your nodes are vm001 and vm002 and you name that group of Dedicated VMs foo:

kubectl taint nodes vm001 cocalc-dedicated_vm=foo:NoSchedule
kubectl label nodes vm001 cocalc-dedicated_vm=foo
kubectl taint nodes vm002 cocalc-dedicated_vm=foo:NoSchedule
kubectl label nodes vm002 cocalc-dedicated_vm=foo

Finally, to create the corresponding license:

Open Admin → Site Licenses….
Create a new license.
Set the [taint-name] of the Dedicated VM in the text field (e.g. foo) for the Dedicated VM name.
Configure RAM and CPU quotas as well.
Save the license and double-check the configuration in the shown quota JSON object.
Send the license key to those users of yours, who should be allowed to run their projects on that machine.

Once they add that license key to their project, it will restart, and the management service will outfit that project pod with the corresponding taint toleration and enforce running that pod on a node with a matching label.

GPU nodes#

Note

This section is work in progress and only describes the basic idea.

If you have nodes with one or more GPUs, you can forge special licenses, which request a GPU for the project, where such a license has been applied.

How this is setup in detail is beyond the scope of this guide. This requires setting up GPU support on these nodes, changing the container runtime, and customizing the project’s software image.

What you need to know is how a project pod must be configured, in order to request a GPU in your cluster. That change is can be defined via a Patch.