Kubernetes troubleshooting

This guide will help you troubleshoot a Deephaven system deployed in Kubernetes, especially if you have limited experience with Kubernetes. It assumes you are familiar with concepts such as Pods, Deployments, StatefulSets, and Secrets.

For general system troubleshooting guidance applicable to all platforms, refer to the main troubleshooting guide.

For example, Persistent Queries run in workers, and worker information is available in the Process Event Log. You can query this log in a Code Studio using the commands as outlined in Debugging code studio.

Note

The kubectl and helm commands below assume that your environment is configured for your namespace by default. If not, add -n <your-namespace> to these commands.

How do I see the deployment names I used to install Deephaven or etcd?

Run helm list to see the deployment names.

How do I see the Kubernetes objects in the Deephaven cluster?

You can use kubectl get to see the list of objects of a certain type.

Example usages of kubectl get
# Get all objects in a namespace - pods, services, deployments, statefulsets, jobs, cronjobs.
$ kubectl get all
# Get all pods. You can query deployments, services, etc, similarly.
$ kubectl get pods
# Get all pods, and after listing the requested objects, watch for changes with the -w flag.
$ kubectl get pods -w
# kubectl get all does not retrieve secrets or configmaps, but you can query them.
$ kubectl get secrets
$ kubectl get configmaps
# You can use -o yaml or -o json to introspect the manifest of the results.
$ kubectl get pod query-server-c785f4445-rvppq -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-12-20T00:32:41Z"
  generateName: query-server-c785f4445-
  name: query-server-c785f4445-rvppq
  labels:
    app: query-server
...
# You can use -o jsonpath to filter fields from the yaml.
# For more info see https://kubernetes.io/docs/reference/kubectl/jsonpath/
$ kubectl get pod query-server-c785f4445-rvppq -o jsonpath='{.metadata.name}'
query-server-c785f4445-rvppq%

How do I view logs for a container?

Process logs are sent to standard out, so you can view them using the kubectl logs command. First, identify the containers in the pod, then you can tail the log of the specific container. Use the -f option to tial the log, not just print it.

# See all containers in the pod with the top command.
$ kubectl top pod --containers=true query-server-c785f4445-rvppq
POD                            NAME           CPU(cores)   MEMORY(bytes)
query-server-c785f4445-rvppq   query-server   8m           1014Mi
query-server-c785f4445-rvppq   tailer         8m           604Mi

# Then tail the log of the query-server container within.
$ kubectl logs -f -c query-server query-server-c785f4445-rvppq

Note that you can use the object type and static name instead of the pod name, which can change. However, if multiple pods exist, it will only select one to operate on. Deephaven deployments have one pod per deployment, so this approach is safe:

# Tail logs of the las container within the las deployment's one pod.
$ kubectl logs -f -c las deploy/las

Depending on your version, some Deephaven components may be a statefulset instead of a deployment, and those may have multiple pods.

# Here we list the pods with an 'app' label equal to 'controller'. These belong
# to a statefulset, not a deployment, so pods are suffixed predictably with -0, -1, etc.
$ kubectl get pods -l app=controller
NAME           READY   STATUS    RESTARTS   AGE
controller-0   2/2     Running   0          4h48m
controller-1   2/2     Running   0          4h47m

Pod names for statefulset objects are predictably named, so you can easily get logs for them by pod name without first listing pod names.

# Tail logs for one controller pod in the stateful set.
$ kubectl logs -f -c controller controller-0

How do I run commands on the command line as shown elsewhere in the documentation?

The Deephaven Kubernetes deployment includes a management-shell deployment with a single pod for this purpose. You can use the kubectl exec command to open a session on the pod and run commands.

# Open a session on the management-shell pod.
$ kubectl exec -it deploy/management-shell -- /bin/bash

Here is an example using a management shell to access a Deephaven config file.

# Make a directory under /tmp.
$ mkdir /tmp/myprops

# Run the dhconfig command to export the iris-environment.props file.
$ /usr/illumon/latest/bin/dhconfig properties export -f iris-environment.prop -d /tmp/myprops/
Exporting properties file iris-environment.prop

$ ls -l /tmp/myprops
total 8
-rw-r--r-- 1 root root 6269 Dec 24 13:16 iris-environment.prop

How do I scale my Deephaven deployment down or up?

Scale your deployment with the script provided in the setupTools directory of the Deephaven helm chart distribution, providing an argument of down or up.

$ ./setupTools/scaleAll.sh down

How can I tell if my persistent volume storage class allows for volume expansion?

First, get a listing of your persistent volume claims (PVCs). In the output, you will see a column for STORAGECLASS.

$ kubectl get pvc
NAME                  STATUS   VOLUME         CAPACITY   ACCESS MODES   STORAGECLASS
...
dis-db-intraday       Bound    pvc-4a2e4de2   2Gi        RWO            standard-rwo
dis-db-intradayuser   Bound    pvc-3952fce1   2Gi        RWO            standard-rwo
...

You can then get information about that storage class. If ALLOWVOLUMEEXPANSION is listed as true, then it is expandable.

$ kubectl get storageclass standard-rwo
NAME           PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION
standard-rwo   pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   true

How do I increase the DIS volume size used for /db/Intraday or /db/IntradayUser?

First check to see that your volume's storage class allows volume expansion. For storage classes that allow it, you can increase the volume sizes by adding or editing this section of the override YAML file you reference in your helm upgrade command. You can increase the size of one or both, as shown here.

Note

Some caveats may apply depending on your Kubernetes version and volume type; see the Kubernetes documentation for more details.

dis:
  intradaySize: 20Gi
  intradayUserSize: 20Gi

Then scale down your deployment with the script provided in the Deephaven helm chart and run your helm upgrade as documented in the upgrade guide.

How do I increase the volume size used for /var/log/deephaven/binlogs in a deployment?

First check to see that your volume's storage class allows volume expansion. The binlogs used in the Deephaven helm chart for a deployment are a persistent volume claim that can be expanded by adding or editing the configuration in the override YAML file you reference in your helm upgrade command.

# This example shows increasing the LAS deployment's binlogs size.
resources:
  las:
    binlogsSize: 3Gi

Then scale down your deployment with the script provided in the Deephaven helm chart and run your helm upgrade as documented in the upgrade guide.

How do I increase the volume size used for /var/log/deephaven/binlogs in a statefulset?

First check to see that your volume's storage class allows volume expansion. The binlogs persistent volume claim (PVC) is part of the definition of the statefulset, so it is handled differently than a binlog PVC used for a deployment. In this example, we upgrade the capacity of the controller's binlogs.

Identify the binlog PVC names by running kubectl get pvc. If you have two pods for the controller, you will see two PVCs named binlogs-controller-0 and binlogs-controller-1. To edit a PVC, use the command kubectl edit pvc binlogs-controller-0.

Change the YAML value at spec.resources.requests.storage to your desired capacity, then save and exit. In this YAML snippet, we increase a volume from its previous value to 10Gi.

...
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
...

Edit all PVCs in the statefulset before continuing. The volume should resize asynchronously, and you can check progress with kubectl get pvc binlogs-controller-0 or kubectl describe pvc binlogs-controller-0.

You should know the override YAML used to install or upgrade the Helm chart. If you don't have it available, create a new file by running helm get values my-deephaven-deployment-name. Add or edit the resources section to include the new capacity value of your PVCs. If this is not updated, future Helm upgrades may result in an error.

...
resources:
  las:
    binlogsSize: 10Gi
...

The next step is deleting the statefulset.

$ kubectl delete statefulset --cascade=orphan controller

Then scale down your deployment with the script provided in the Deephaven helm chart and run your helm upgrade as documented in the upgrade guide, using the override YAML file edited as part of this process.

How do I decrease the size of a persistent volume?

There is no direct support for decreasing a volume's capacity in Kubernetes.

How do I prevent a Persistent Query's worker pod from getting removed immediately after the process terminates?

In the settings tab of the Persistent Query configuration editor, use the Extra Env Variables input box to define WORKER_POD_EXIT_WAIT_SEC to delay the pod removal; for example, WORKER_POD_EXIT_WAIT_SEC=600.

Use this judiciously -- only when required for troubleshooting -- as extended delays to pod deletions will cause query restarts and system shutdowns to wait until the processes terminate.

How do I debug a process using a remote debugger?

Follow the steps below to debug a non-worker Deephaven Java process with a remote debugger.

First, configure your override file with port and debug information by adding configurations for the process you want to debug.

# Use any port you like, but debugPort should match what is in the jvmArgsUser string
# Refer to the `process` section of the `values.yaml` file in the Deephaven chart to
# see what other processes may be configured.
process:
  query-server:
    debugPort: 5005
    jvmArgsUser: '-Xdebug -agentlib:jdwp=transport=dt_socket,address=0.0.0.0:5005,server=y,suspend=n'

Then scale down your deployment with the script provided in the Deephaven helm chart and run your helm upgrade as documented in the upgrade guide, using the override YAML file edited as part of this process.

Set up port forwarding from your local machine to the pod using the debugPort configured earlier. If the debugPort is not known, find the name of the pod you want to debug by running kubectl get pods.

$ kubectl port-forward query-server-c785f4445-mrwsj 5005:5005

After your pod has restarted, you can run a remote debugger as normal.

How do I make a certificate or other piece of external data available within my Persistent Queries?

You can create a secret and mount its data in workers by adding a workerExtraSecrets section to your chart override YAML and running a helm upgrade as documented in the upgrade guide. All keys of the secret will be mapped as files under the workerExtraSecrets.mountPath on your worker pods.

workerExtraSecrets:
  - name: "my-worker-secret-1"
    mountPath: "/data/secret-1"
    secretName: "my-secret-name"

Similarly, you can also mount an existing persistent volume claim to worker pods with a workerExtraVolumes section like this.

workerExtraVolumes:
  - name: "my-worker-vol-1"
    claimName: "my-worker-pvc"
    mountPath: "/data/vol-1"

How do I look at the disks used by the Data Import Server?

You may want to look at the mounted disks used by the Data Import Server (DIS) to see how much space is available or perform other actions such as removing old data. You can do this by examining the filesystem on the DIS pod. For example, to see the available space on the disks, run:

kubectl exec -it -c dis deploy/dis -- df -h

What if the DIS isn't running?

If the DIS isn't running (which can happen if the volumes are full), you'll need to create a temporary pod which mounts the same volumes as the DIS. This procedure is only helpful to debug and resolve issues when the DIS is down, because the new pod won't be able to mount the PVCs while the DIS is using them.

First, find the names of the PVCs used by the DIS:

kubectl get pvc | grep dis

This will return a result like this:

dis-binlogs                                        Bound    pvc-9b171cce-7681-4216-868d-c1bb5feb3db8          2Gi        RWO            standard-rwo   <unset>                 279d
dis-db-intraday                                    Bound    pvc-5b1ecea1-6071-4543-96d6-c0a17a76a615          10Gi       RWO            standard-rwo   <unset>                 279d
dis-db-intradayuser                                Bound    pvc-9f7fe622-1133-41f5-aa8b-963eeb669193          10Gi       RWO            standard-rwo   <unset>                 279d

Then, create a pod with the same PVCs mounted. You can use the following YAML to create a pod that mounts the same volumes as the DIS, using the names from the previous command.

You'll need the image name currently deployed in the management shell pod. You can find it with a command like this:

kubectl describe pod $(k get pods | grep management-shell | awk '{print $1}') | grep "Image:"

Then create a file called my-temp-pod.yaml. You can use the following YAML as a template, but be sure to change the image to the one you found in the previous command. If the disk PVC names are different, you'll also need to update both those sections.

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: my-temp-pod
  name: my-temp-pod
spec:
  containers:
    - command: ['/bin/sh', '-c', 'sleep infinity']
      image: us-central1-docker.pkg.dev/qa-k8s-clusters/images/jdk17/20240517/deephaven_management:1.20240517.437
      name: my-temp-container
      volumeMounts:
        - mountPath: /var/log/deephaven/binlogs
          name: my-bin
        - mountPath: /db/Intraday
          name: my-intraday
        - mountPath: /db/IntradayUser
          name: my-user
  terminationGracePeriodSeconds: 1
  volumes:
    - name: my-bin
      persistentVolumeClaim:
        claimName: dis-binlogs
    - name: my-intraday
      persistentVolumeClaim:
        claimName: dis-db-intraday
    - name: my-user
      persistentVolumeClaim:
        claimName: dis-db-intradayuser

Create the pod:

kubectl create -f my-temp-pod.yaml

Now, log on to it with a bash shell and execute commands like df to view the disks, and purge data if needed:

kubectl exec -it my-temp-pod -- bash

When you're done, delete the pod. Don't forget this or your system won't start correctly due to the mounted PVCs:

k delete pod my-temp-pod