Troubleshooting#

Cannot “Connect”#

When working with CoCalc’s Single Page Application, there is a “Connectivity indicator” at the top right, looking like a WiFi symbol. It surfaces the status of an underlying WebSocket connection. Feel free to click on it to open a dialog with more details. If there are problems, this indicator will turn red or even say “disconnected”.

There are various reasons why someone is not able to connect. The most common issue could be a firewall blocking the connection, or a proxy (VPN?) that is not configured correctly.

Here is an extensive collection of connectivity issues. This ranges from a flaky local router, a browser (or even computer) that needs to be restarted, up to plugins/extensions interfering with the website itself. In particular, it’s important to figure out, if WebSocket connections do work.

Besides that, it might be a good idea to just refresh the page. The UI should open up again, and all opened projects and files should reappear.

You can also open the browser’s developer console (F12) to see if there are any errors in the JavaScript console.

On the server side, the logs of hub-websocket (see Hubs) could reveal issues as well.

Project does not “Start”#

There are various reasons why a project fails to start.

Note

[UUID] refers to the UUID identifying each project.

  1. Check if there exists a pod named project-[UUID] in your namespace. If not, the manage-action service (see Manage) did not create it.

    • What does its log say?

      kubectl logs manage-action-...
      

      Filter by the UUID to see log lines related to the project.

    • Maybe it’s a good idea to just kill all manage services:

      kubectl rollout restart deploy -l group=manage
      

      they’ll restart and then check the logs again. The “start” request for the given project will probably have timed out – hence you have to use the UI to start the project again.

    • Connect to the database and reset the project(s) state/status entries in the database:

      UPDATE projects SET status=null, state=null WHERE project_id='[UUID]';
      

      and then try to start the project again.

  2. If the project’s pod exists and is not in state “Running”, check what

    kubectl describe pod project-[UUID]
    

    has to say. At the very bottom is a list of actions/events taken by the kubelet on the given node to setup the pod.

    • Did the “home” filesystem not mount? Check what command it tried (a mount command) to run, what errors there are, etc. This is the part where a sub-directory of the data-project PVC must be mounted on the node and made available to the project’s pod.

    • Is the node, where the project tries to start, healthy? It could very well be that some projects run fine, while there is a problem specific to a node. So, check your node setup, permissions, maybe kill the node and create a new one.

    • Of course, there could be other issues. Most of them are probably Leaky Abstractions of the underlying node, networking, etc. – so, check the logs of the kubelet, the node in /var/log/*, dmesg -T for the Linux kernel, etc.

  3. If the project pod is in state “Running”:

    • Check the project’s log. It starts with an initialization sequence (a Bash script /cocalc/init/init.sh is run) and then starts the actual project “server” (a Node.js process).

    • If the initialization worked, the hub try to connect to the project server to establish the communication. So, the first important log message is that incoming connection. If it fails, networking issue inside the cluster?!

  4. There could also be a problem of your web browser (web client) to connect to the service. That would look as if it is attempting to connect but does not succeed. Check the browser’s developer console for errors, and also check the logs of hub-websocket and hub-proxy.

Project pod “hangs” upon termination#

  • The first step is to check what kubectl describe pod project-[UUID] has to say.

  • If there is a Datastore sidecar, leftover FUSE mountpoints might be a source of problems. Although there is a termination task to unmount and fusermount -u all mountpoints, it might be broken due to a faulty tool or other circumstances. The only solution is to force-delete the project pod, because Kubernetes on its own will not be able to fix that mount issue.

Jupyter Kernel does not start#

  • A Python-based kernel creates a history in an sqlite database. This could lead to issues with an underlying NFS file-system. See Custom Jupyter Kernels for how to disable this by adding a config parameter to the kernel’s argv:[...] array in the kernel.json file.

  • In a Terminal, run jupyter console --kernel=[kernelname] to see if the kernel starts. In the terminal you might see more logging details by adding --debug as well.

  • There might also be a fluke with how the project started the kernel and no proper cleanup after it terminated or crashed. Looking in the project’s log might help, but in general it might be a good idea to just restart the project (Project Settings → Project Control) and refresh CoCalc’s frontend page in the web-browser.

Database#

To connect to the database, you can either connect directly or you could use the included /database/db-shell.sh script. This script deploys a pod in the cluster’s namespace, and then connects to the database. You need to know the host IP or name, user, and database. The password is parsed from the secret. That way, you also confirm that accessing the database from within the cluster works.

Interruptions#

Right after starting the cluster or every time the database was stopped, the hubs try to reconnect. This could fail or take a long time to resolve (a health check should fail, causing the pod to restart).

However, it’s completely fine to just restart all hub and manage related pods.

kubectl rollout restart deploy -l group=manage
kubectl rollout restart deploy -l group=hub

Lockups#

In rare circumstances, or when services were disrupted, it might happen that the database is locked up. This means there are hanging queries, which mutually want to modify or delete rows or tables. They can’t continue because they wait on each other.

To resolve this in a safe way, you can create a view of all active locks, and then cancel/terminate the connection.

CREATE OR REPLACE VIEW public.active_locks AS
 SELECT t.schemaname,
    t.relname,
    l.locktype,
    l.page,
    l.virtualtransaction,
    l.pid,
    l.mode,
    l.granted
   FROM pg_locks l
   JOIN pg_stat_all_tables t ON l.relation = t.relid
  WHERE t.schemaname <> 'pg_toast'::name AND t.schemaname <> 'pg_catalog'::name
  ORDER BY t.schemaname, t.relname;

Then run:

SELECT * FROM active_locks;

and you see a table like that – note, all PIDs are the same!

 schemaname |       relname        | locktype | page | virtualtransaction |  pid  |       mode       | granted
------------+----------------------+----------+------+--------------------+-------+------------------+---------
 public     | accounts             | relation |      | 10/7853190         | 37038 | AccessShareLock  | t
 public     | compute_images       | relation |      | 10/7853190         | 37038 | AccessShareLock  | t
 public     | hub_servers          | relation |      | 10/7853190         | 37038 | RowExclusiveLock | t
 public     | hub_servers          | relation |      | 10/7853190         | 37038 | AccessShareLock  | t
 public     | registration_tokens  | relation |      | 10/7853190         | 37038 | AccessShareLock  | t
 public     | registration_tokens  | relation |      | 10/7853190         | 37038 | RowShareLock     | t
 public     | server_settings      | relation |      | 10/7853190         | 37038 | AccessShareLock  | t
 public     | stats                | relation |      | 10/7853190         | 37038 | AccessShareLock  | t
 public     | system_notifications | relation |      | 10/7853190         | 37038 | AccessShareLock  | t
(9 rows)

Then try to cancel – or if necessary, more forcefully terminate – the connection:

SELECT pg_cancel_backend('37038');
SELECT pg_terminate_backend('37038');

Run the view again, and you should see that the locks are gone.

Kubernetes Errors#

HELM upgrade timeout#

No problem. The default is 5 minutes. Increase it by adding this timeout parameter to increase it to 15 minutes:

helm upgrade --timeout 15m ...

and run it again. HELM upgrades are idempotent and it’s safe to run them multiple times.

You have to check the status of the K8S workloads, though, but most likely pulling the new images takes longer than the default timeout. The pod state ContainerCreating is a good indicator for that, and if you check the details, the log says something about pulling image.

Evicted due to DiskPressure#

Since the project images are large, you might encounter evicted project or Prepull pods due to DiskPressure. Kubernetes will clean up the local docker image registry, but as long as there are pods running, it will not be able to free up enough space.

A strategy to fix this:

  1. Edit the project-image ConfigMap to the new project image tag:

    kubectl edit cm project-image

    and then set data.tag to the new project-[timestamp] value.

  2. Delete all running projects:

    kubectl delete pod --wait=false -l run=project

The idea is that the next project that starts will pull the new image, and the old ones can be garbage collected.

Long term, you need more disk space on your nodes.