Amazon AWS#
This guide helps you setting up CoCalc Cloud on AWS. It will use it’s EKS Kubernetes service to run CoCalc Cloud.
Note
As of 2022-07-03, there is currently no out-of-the-box support for EKS. The following are notes based on the experience of setting everything up. This certainly assumes you have experience with AWS and Kubernetes. Some details could be out of date, but the general idea should still be valid.
There is also a guide for setting up CoCalc Cloud on Google GCP.
This also assumes you checked the general documentation for the CoCalc Cloud HELM deployment, e.g. to setup your own values.yaml file somewhere, to overwrite configuration values, know how to setup a secret storing the PostgreSQL database password, etc.
For more details look into Setup.
EKS configuration#
Setup your EKS cluster and make sure you can communicate with the cluster via your local kubectl client, etc. E.g. run
aws eks --region [your region] update-kubeconfig --name [name of cluster]
to get started.
Node Groups#
EKS should be configured to run two groups of nodes:
Services: the service nodes run hubs, manage, static, etc. To get started, two small nodes should be fine.
Projects: these nodes will hold the projects. They should be configured to have a certain taint and labels, right when they’re created.
Here is a minimal example to get started:
“service”: this was good enough for a minimal setup:
2x
t3.medium
(ort3a.medium
), spot price, and 64GB of root disk (the project image is large!)NOTE: “t3” might be a bad choice, because there is a low limit of IPs per node. Also, stuff like https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html is not supported for t3 nodes (but not used at all, something to explore later on)
disk: 50GiB
set the Kubernetes label to
cocalc-role=services
(that’skey=value
)scaling: 2/2/2, such that you have two such nodes running.
“project”: if you expand this to have separate nodes for the projects, create nodes with rather more ram than cpu, because memory is not elastic, but cpu is. Usually, in interactive usage, most of the time the project will wait for user input.
machine:
t3.medium
, disk: 100GiB. (the project image is large, and we might have to store two or more at the same time!)then, to make full use of that prepull service:
activate it by setting it to “true” in your
values.yaml
configuration file and given you keep the label/taint values as they are by default:
set the Kubernetes label to
cocalc-role=projects
(that’skey=value
)and the initial Kubernetes taint (this is
key=value → effect
) to:cocalc-projects-init=false
→NoExecute
cocalc-projects=init
→NoSchedule
The taints above signal the prepull service, that the node was not yet initialized (the daemon set will start pods on such nodes) and once the prepull pod is done it changes the taints to allow regular projects to run on these nodes and well, it also removes itself from that node. If you need to audit what prepull does (might be wise, since it needs cluster wide permissions to change the node taints), please check the included
prepull.py
script.scaling: 1/2/1 or whatever you need
Storage/EFS#
The projects and some services need access to a storage volume, which
allows ReadWriteMany. Commonly, this could be done via an NFS
Server, but with AWS there is EKS – much better! To get EKS running in
your EKS cluster, follow the instructions.
In particular, I had to install eksctl
, install an “OIDC” provider, then create a service account, etc.
Next step was to install the EFS driver via HELM, and actually create an EFS filesystem, give it access to all subnets (in my case there were 3), create a mount target, etc.
Now the important part: this EFS filesystem’s “access point” is only for root, by default.
To make this work with CoCalc’s services,
it must be for user/group with ID 2001:2001.
To accomplish this, create a new StorageClass
(you can choose the basePath
as you wish, keeps this instance of CoCalc separate from other instances or other data you have on EFS):
Create a file
sc-2001.yaml
with the following content:kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: efs-2001 provisioner: efs.csi.aws.com parameters: provisioningMode: efs-ap fileSystemId: fs-[INSERT ID] directoryPerms: "700" uid: "2001" gid: "2001" basePath: "/cocalc1"
Apply:
kubectl apply -f sc-2001.yaml
.Check:
kubectl get sc
should listefs-2001
.Edit your
values.yaml
file: in the section for storage, enter this to reference the newStorageClass
:
storage:
class: "efs-2001"
size:
software: 10Gi
data: 10Gi
which in turn will create the PersistentVolume
+ Claims
as required. Size doesn’t matter, it’s unlimited.
Additional hints:
You can change the
Reclaim Policy
toRetain
, such that files aren’t accidentally deleted if these PVs are removed. See https://kubernetes.io/docs/tasks/administer-cluster/change-pv-reclaim-policy/Set the life-cycle management of EFS to move unused files to long term (cheaper) storage and back if they’re accessed again. e.g.:
Transition into IA: 60 days since last access
Transition out of IA: On first access
Database / RDS PostgreSQL#
You could either run your own PostgreSQL server, or use the one from AWS: RDS PostgreSQL. Version 13 should be ok, you can also go ahead and use version 14.
Basically, the EKS cluster must be able to access the database
(networking setup, security groups) and the database password will be
stored in a Kubernetes secret. (see cocalc/values.yaml
→
global.database.secretName
)
Refer to the general instructions for the database how to do this,
i.e. kubectl create secret generic postgresql-password --from-literal=postgresql-password=$PASSWORD
should do the trick.
Docs that might help:
AWS Security Groups#
At this point, your service consists of a database, the EKS cluster (with its nodes and own VPC network), and the EFS filesystem. However, by default AWS isolates everything from each other. You have to make sure that there is a suitable setup of Security Groups that allows the EKS nodes to access the database and the EFS filesystem. This guide doesn’t contain a full description of how to do this, and this certainly depends on your overall usage of AWS. The common symptom is that Pods in EKS can’t access the database or the EFS filesystem, hence you see timeout errors trying to connect, etc. EFS manifests in pods not being able to initialize, can’t attach the volumes, etc., while the database manifests in the logs of “hub-websocket” pods. (it is responsible for setting up all tables/schemas in the database, hence this is the one to check first)
Notes:
EKS vs. EFS security groups: this is from a workshop, maybe it helps
Ingress/Networking#
In the CoCalc HELM deployment, there are two ingress.yaml
configurations, which are designed for K8S’s nginx ingress controller.
The directory ingress-nginx/
has more details.
But just deploying it is not enough: the nginx ingress controller needs to be able to install a LoadBalancer
.
That’s done via an AWS Load Balancer Controller.
Once everything is running, you can check up on the Load Balancer via the AWS console: EC2 (new experience) → Load Balancing → Load balancer.
There, in the Basic Configuration, you see the DNS name - that’s the
same you get via kubectl get -A svc
Once you have that (lengthy) automatically generated DNS name, copy it
and setup your own sub-domain in your DNS provider. Basically add a
CNAME
entry to point to this DNS name.
What’s unclear to me, this did create a “classic” (deprecated) load balancer. Why not a more modern L4 network load balancer? Must be caused by whatever the load balancer controller is supposed to do.
Ref: