Troubleshooting#
Cannot “Connect”#
When working with CoCalc’s Single Page Application, there is a “Connectivity indicator” at the top right, looking like a WiFi symbol. It surfaces the status of an underlying WebSocket connection. Feel free to click on it to open a dialog with more details. If there are problems, this indicator will turn red or even say “disconnected”.
There are various reasons why someone is not able to connect. The most common issue could be a firewall blocking the connection, or a proxy (VPN?) that is not configured correctly.
Here is an extensive collection of connectivity issues. This ranges from a flaky local router, a browser (or even computer) that needs to be restarted, up to plugins/extensions interfering with the website itself. In particular, it’s important to figure out, if WebSocket connections do work.
Besides that, it might be a good idea to just refresh the page. The UI should open up again, and all opened projects and files should reappear.
You can also open the browser’s developer console (F12) to see if there are any errors in the JavaScript console.
On the server side, the logs of hub-websocket
(see Hubs) could reveal issues as well.
Project does not “Start”#
There are various reasons why a project fails to start.
Note
[UUID]
refers to the UUID identifying each project.
Check if there exists a pod named
project-[UUID]
in your namespace. If not, themanage-action
service (see Manage) did not create it.What does its log say?
kubectl logs manage-action-...
Filter by the UUID to see log lines related to the project.
Maybe it’s a good idea to just kill all manage services:
kubeclt delete pod --wait=false -l group=manage
they’ll restart and then check the logs again. The “start” request for the given project will probably have timed out – hence you have to use the UI to start the project again.
If the project’s pod exists and is not in state “Running”, check what
kubectl describe pod project-[UUID]
has to say. At the very bottom is a list of actions/events taken by the kubelet on the given node to setup the pod.
Did the “home” filesystem not mount? Check what command it tried (a mount command) to run, what errors there are, etc. This is the part where a subdirectory of the
data-project
PVC must be mounted on the node and made available to the project’s pod.Is the node, where the project tries to start, healthy? It could very well be that some projects run fine, while there is a problem specific to a node. So, check your node setup, permissions, maybe kill the node and create a new one.
Of course, there could be other issues. Most of them are probably Leaky Abstractions of the underlying node, networking, etc. – so, check the logs of the kubelet, the node in
/var/log/*
,dmesg -T
for the Linux kernel, etc.
If the project pod is in state “Runnung”:
Check the project’s log. It starts with an initialization sequence (a Bash script
/cocalc/init/init.sh
is run) and then starts the actual project “server” (a Node.js process).If the initialization worked, the hub try to connect to the project server to establish the communication. So, the first important log message is that incoming connection. If it fails, networking issue inside the cluster?!
There could also be a problem of your web browser (web client) to connect to the service. That would look as if it is attempting to connect but does not succeed. Check the browser’s developer console for errors, and also check the logs of
hub-websocket
andhub-proxy
.
Project pod “hangs” upon termination#
The first step is to check what
kubectl describe pod project-[UUID]
has to say.If there is a Datastore sidecar, leftover FUSE mountpoints might be a source of problems. Although there is a termination task to unmount and
fusermount -u
all mountpoints, it might be broken due to a faulty tool or other circumstances. The only solution is to force-delete the project pod, because Kubernetes on its own will not be able to fix that mount issue.
Database#
To connect to the database, you could use the included /database/db-shell.sh
script.
You need to know the host IP or name, user, and database.
The password is parsed from the secret.
Interruptions#
Right after starting the cluster or every time the database was stopped, the hubs try to reconnect. This could fail or take a long time to resolve (a health check should fail, causing the pod to restart).
However, it’s completely fine to just kill all hub and manage related pods.
kubectl delete pod --wait=false -l group=manage
kubectl delete pod --wait=false -l group=hub
Lockups#
In rare circumstances, or when services were disrupted, it might happen that the database is locked up. This means there are hanging queries, which mutually want to modify or delete rows or tables. They can’t continue because they wait on each other.
To resolve this in a safe way, you can create a view of all active locks, and then cancel/terminate the connection.
CREATE OR REPLACE VIEW public.active_locks AS
SELECT t.schemaname,
t.relname,
l.locktype,
l.page,
l.virtualtransaction,
l.pid,
l.mode,
l.granted
FROM pg_locks l
JOIN pg_stat_all_tables t ON l.relation = t.relid
WHERE t.schemaname <> 'pg_toast'::name AND t.schemaname <> 'pg_catalog'::name
ORDER BY t.schemaname, t.relname;
Then run:
SELECT * FROM active_locks;
and you see a table like that – note, all PIDs are the same!
schemaname | relname | locktype | page | virtualtransaction | pid | mode | granted
------------+----------------------+----------+------+--------------------+-------+------------------+---------
public | accounts | relation | | 10/7853190 | 37038 | AccessShareLock | t
public | compute_images | relation | | 10/7853190 | 37038 | AccessShareLock | t
public | hub_servers | relation | | 10/7853190 | 37038 | RowExclusiveLock | t
public | hub_servers | relation | | 10/7853190 | 37038 | AccessShareLock | t
public | registration_tokens | relation | | 10/7853190 | 37038 | AccessShareLock | t
public | registration_tokens | relation | | 10/7853190 | 37038 | RowShareLock | t
public | server_settings | relation | | 10/7853190 | 37038 | AccessShareLock | t
public | stats | relation | | 10/7853190 | 37038 | AccessShareLock | t
public | system_notifications | relation | | 10/7853190 | 37038 | AccessShareLock | t
(9 rows)
Then try to cancel – or if necessary, more forcefully terminate – the connection:
SELECT pg_cancel_backend('37038');
SELECT pg_terminate_backend('37038');
Run the view again, and you should see that the locks are gone.
Kubernetes Errors#
selfLink was empty#
If projects fail with selfLink was empty, can't make reference
,
your kubectl version and your cluster’s version are probably not compatible.
Check what kubectl version has to say.