Troubleshooting a Kubernetes installation

Access logs

Run kubectl logs -c <process> <pod> to get the logs for a given process. To avoid needing to copy-paste the pod name, you can use kubectl to get the pod name via label, and pass that into the logs command:

kubectl logs $(kubectl get pod -o name -l app=webapi  |cut -d/ -f 2) -c webapi

Shell access

If you need to examine the installation, you can use a management-shell pod. All the volumes are mounted read-write, so you can also update files as necessary.

kubectl exec deploy/management-shell --tty --stdin -- /bin/bash

Leftover worker pods

If you have leftover worker pods, they may hold onto PVCs (PersistentVolumeClaims), preventing a new installation.

For example:

$ helm install merry-meerkat  ./deephaven/ -f values.yaml
Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: PersistentVolume "dhprefix-pv-db-systems" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "merry-meerkat": current value is "limber-llama"
$ kubectl get pvc
NAME                                   STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
dhprefix-nfs-pvc-etcdbackup-deephaven   Bound         dhprefix-nfs-pv-etcdbackup-deephaven-2      1Gi        RWX                           25h
dhprefix-pvc-db-systems                 Terminating   dhprefix-pv-db-systems                      1Gi        RWX                           18h
dhprefix-pvc-db-tempfiles               Terminating   dhprefix-pv-db-tempfiles                    1Gi        RWX                           18h
dhprefix-pvc-db-users                   Terminating   dhprefix-pv-db-users                        1Gi        RWX                           18h
dhprefix-pvc-db-venvs                   Terminating   dhprefix-pv-db-venvs                        1Gi        RWX                           18h
dhprefix-pvc-etc-sysconfig-deephaven    Terminating   dhprefix-pv-etc-sysconfig-deephaven         1Gi        RWX                           18h
dhprefix-pvc-var-log-deephaven          Terminating   dhprefix-pv-var-log-deephaven               1Gi        RWX                           18h
data-dhprefix-etcd-0                    Bound         pvc-bb30718e-2f36-490d-990e-d02a0da4adac    8Gi        RWO            standard       25h
pvc-dhprefix-nfs-server                 Bound         pvc-bf7e3d0e-88c7-4be2-b6cc-6dd44b354c8a    10Gi       RWO            premium-rwo    27h
$ kubectl get pods
NAME                                      READY   STATUS      RESTARTS   AGE
dhprefix-etcd-0                           1/1     Running     0          25h
merge-server-f94cf8b9-k8n4n-worker-1      0/1     Completed   0          17h
merge-server-f94cf8b9-k8n4n-worker-6      0/1     Error       0          13h
merge-server-f94cf8b9-nzftj-worker-36     0/1     Error       0          17h
nfs-server-59c49c4cd7-zqh2f               1/1     Running     0          27h
query-server-76f8db9d94-544dv-worker-1    0/1     Completed   0          17h
query-server-76f8db9d94-544dv-worker-2    0/1     Completed   0          17h
query-server-76f8db9d94-544dv-worker-8    0/1     Error       0          13h
query-server-76f8db9d94-544dv-worker-9    0/1     Error       0          13h
query-server-76f8db9d94-glw9d-worker-69   0/1     Error       0          17h
query-server-76f8db9d94-glw9d-worker-70   0/1     Error       0          17h

This can be corrected by deleting those pods:

kubectl delete pods -l 'role=query-worker'
kubectl delete pods -l 'role=merge-worker'

If there are other leftover resources after uninstalling the release, they should also be removed via kubectl, as in the following example.

Caution

Note that removing the PVCs for intraday data will delete any unmerged intraday data.

# Remove preinstall hook:
kubectl delete jobs,secrets -l app.kubernetes.io/instance=dhe-k8s-test

# Remove intraday data volumes:
kubectl delete jobs,secrets,pv,pvc -l app.kubernetes.io/instance=dhe-k8s-test

Some files in the NFS data directory, which includes caches and generated TLS/etcd keys, should be removed as well. If the initial data was extracted to /exports/dhsystem on the NFS server, then the appropriate command to clean up the outdated configuration and caches (without deleting user or system data) would be:

sudo rm -vrf /exports/dhsystem/{db/TempFiles,etc}

Restart a pod

Most pods can be restarted by scaling their deployment down to 0 and then back to 1 pod with kubectl scale deployment <deployment-name> --replicas=0 followed by kubectl scale deployment <deployment-name> --replicas=1.

Debug template syntax errors

If you have a syntax error, it is often not clear where it is coming from. To debug, first run the templating engine from helm:

helm template ./deephaven/ -f awsValues.yaml

If the YAML fails to render, you will get a message like the following:

Error: YAML parse error on deephaven/templates/acl_writer/service.yaml: error converting YAML to JSON: yaml: line 5: mapping values are not allowed in this context

Use the --debug flag to render out invalid YAML to file:

# If you add --debug, you will end up with lots of YAML. You can limit the file to a single one with the `-s` option, as follows (redirected to `/tmp/t.yaml`):
helm template --debug ./deephaven/ -f awsValues.yaml  -s 'templates/acl_writer/service.yaml' >/tmp/t.yaml

Running that file through yamllint may point out your error:

$ yamllint  /tmp/t

 ERROR  YAML Lint failed for 1 file                                                                                                                                                                                                                                                                                                                                     09:58:59

/tmp/t                                                                                                                                                                                                                                                                                                                                                                  09:58:59

 ERROR  bad indentation of a mapping entry (7:14)                                                                                                                                                                                                                                                                                                                       09:58:59

 4 | kind: Service
 5 | metadata:
 6 |   name: aclwriter
 7 |   labels: app: aclwriter
------------------^
 8 | spec:
 9 |   type: ClusterIP

If you have two separate YAML blocks, only the first one will print when the second one has errors, even with the --debug and -s options. You can comment out earlier blocks to get the block of interest to render.

You may also find the --validate flag useful, as it will check your YAML against the Kubernetes object definitions and the state of the cluster.

Debug a pre-install hook

The pre-install hook logs can be retrieved with kubectl logs. You may want to look at log files from an individual run or experiment with running different scripts. To do this, you may introduce a sleep after a failed pre-install script by including a debug.preInstall entry in your values.yaml as follows:

debug:
  preInstall: true

Debug using a remote debugger

You can enable debugging of a remote process by following these steps:

  1. Configure your my-values.yaml file with port and debug information.
  2. Upgrade the Deephaven Helm installation.
  3. Scale deployment(s) to be debugged down to 0 pods, then back to 1 to ensure that they have the required settings.
  4. Forward port from your host to the pod.

Add configurations to your my-values.yaml file for the process you want to debug. See Creating a Service for more information on Kubernetes services and port settings in general.

# Use whatever port you want
process:
  controller:
    jvmArgsUser: '-Xdebug -agentlib:jdwp=transport=dt_socket,address=0.0.0.0:5005,server=y,suspend=n'

Note

Refer to the process section of the values.yaml file in the Deephaven chart to see what other processes may be configured for debugging in addition to controller.

Once the above settings have been configured, you must upgrade the helm chart. This is similar to what was done in the chart installation. Run the following in the root directory where the Deephaven chart directory is: helm upgrade deephaven-helm-release-name ./deephaven/ -f my-values.yaml --set image.tag=<deephaven-release-version> --debug

The easiest way to restart the component is to run kubectl scale deployment/<deployment-name> --replicas=0 followed by kubectl scale deployment/<deployment_name> --replicas=1. To see the names of all deployments, you may run kubectl get deployments.

Next, you should enable port forwarding from your local machine to the deployment with a command like the following that forwards local port 5005 to port 5005 on the controller pod:

kubectl port-forward deployment/<deployment-name> 5005:5005

After your pod has restarted, you can run a remote debugger as normal.

Debug the Swing console

To debug the Swing console, add similar JVM arguments to your getdown.global file, then reload the CUS. You can then debug your locally launched IrisConsole process.