Kubernetes troubleshooting

This guide will help you troubleshoot a Deephaven system deployed in Kubernetes, especially if you have limited experience with Kubernetes. It assumes you are familiar with concepts such as Pods, Deployments, StatefulSets, and Secrets.

For general system troubleshooting guidance applicable to all platforms, refer to the main troubleshooting guide.

For example, Persistent Queries run in workers, and worker information is available in the Process Event Log. You can query this log in a Code Studio using the commands as outlined in Debugging code studio.

Note

The kubectl and helm commands below assume that your environment is configured for your namespace by default. If not, add -n <your-namespace> to these commands.

How do I see the deployment names I used to install Deephaven or etcd?

Run helm list to see the deployment names.

How do I see the Kubernetes objects in the Deephaven cluster?

You can use kubectl get to see the list of objects of a certain type.

Example usages of kubectl get

# Get all objects in a namespace - pods, services, deployments, statefulsets, jobs, cronjobs.
$ kubectl get all

# Get all pods. You can query deployments, services, etc, similarly.
$ kubectl get pods

# Get all pods, and after listing the requested objects, watch for changes with the -w flag.
$ kubectl get pods -w

# kubectl get all does not retrieve secrets or configmaps, but you can query them.
$ kubectl get secrets
$ kubectl get configmaps

# You can use -o yaml or -o json to introspect the manifest of the results.
$ kubectl get pod query-server-c785f4445-rvppq -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-12-20T00:32:41Z"
  generateName: query-server-c785f4445-
  name: query-server-c785f4445-rvppq
  labels:
    app: query-server
...

# You can use -o jsonpath to filter fields from the yaml.
# For more info see https://kubernetes.io/docs/reference/kubectl/jsonpath/
$ kubectl get pod query-server-c785f4445-rvppq -o jsonpath='{.metadata.name}'
query-server-c785f4445-rvppq%

How do I uninstall a Deephaven deployment?

There is a dh_uninstall.sh script in the setupTools directory of the Deephaven helm chart distribution. Running this script with no arguments will try to determine the Helm release names and the persistent volume prefix that was configured when Deephaven was installed, or you may provide them explicitly. Run dh_uninstall.sh -u for usage information. The script will ask for confirmation before proceeding with an uninstall.

How do I view logs for a container?

Process logs are sent to standard out, so you can view them using the kubectl logs command. First, identify the containers in the pod, then you can tail the log of the specific container. Use the -f option to tial the log, not just print it.

# See all containers in the pod with the top command.
$ kubectl top pod --containers=true query-server-c785f4445-rvppq
POD                            NAME           CPU(cores)   MEMORY(bytes)
query-server-c785f4445-rvppq   query-server   8m           1014Mi
query-server-c785f4445-rvppq   tailer         8m           604Mi

# Then tail the log of the query-server container within.
$ kubectl logs -f -c query-server query-server-c785f4445-rvppq

Note that you can use the object type and static name instead of the pod name, which can change. However, if multiple pods exist, it will only select one to operate on. Deephaven deployments have one pod per deployment, so this approach is safe:

# Tail logs of the las container within the las deployment's one pod.
$ kubectl logs -f -c las deploy/las

Depending on your version, some Deephaven components may be a statefulset instead of a deployment, and those may have multiple pods.

# Here we list the pods with an 'app' label equal to 'controller'. These belong
# to a statefulset, not a deployment, so pods are suffixed predictably with -0, -1, etc.
$ kubectl get pods -l app=controller
NAME           READY   STATUS    RESTARTS   AGE
controller-0   2/2     Running   0          4h48m
controller-1   2/2     Running   0          4h47m

Pod names for statefulset objects are predictably named, so you can easily get logs for them by pod name without first listing pod names.

# Tail logs for one controller pod in the stateful set.
$ kubectl logs -f -c controller controller-0

How do I run commands on the command line as shown elsewhere in the documentation?

The Deephaven Kubernetes deployment includes a management-shell deployment with a single pod for this purpose. You can use the kubectl exec command to open a session on the pod and run commands.

# Open a session on the management-shell pod.
$ kubectl exec -it deploy/management-shell -- /bin/bash

Here is an example using a management shell to access a Deephaven config file.

# Make a directory under /tmp.
$ mkdir /tmp/myprops

# Run the dhconfig command to export the iris-environment.props file.
$ /usr/illumon/latest/bin/dhconfig properties export -f iris-environment.prop -d /tmp/myprops/
Exporting properties file iris-environment.prop

$ ls -l /tmp/myprops
total 8
-rw-r--r-- 1 root root 6269 Dec 24 13:16 iris-environment.prop

How do I scale my Deephaven deployment down or up?

Scale your deployment with the script provided in the setupTools directory of the Deephaven helm chart distribution, providing an argument of down or up.

$ ./setupTools/scaleAll.sh down

How do I update the deephaven-tls webserver certificate?

The webserver certificate for the front url of your Deephaven deployment is stored in a Kubernetes secret named deephaven-tls. You can delete the secret with the command kubectl delete secret deephaven-tls and recreate it with a new certificate as described in this step of the install guide.

After the new secret is created, you must scale down the deployment and run a helm upgrade as described here before it is effective. The helm upgrade process runs a script that will apply the new certificate to internal Deephaven trust stores.

How can I tell if my persistent volume storage class allows for volume expansion?

First, get a listing of your persistent volume claims (PVCs). In the output, you will see a column for STORAGECLASS.

$ kubectl get pvc
NAME                  STATUS   VOLUME         CAPACITY   ACCESS MODES   STORAGECLASS
...
dis-db-intraday       Bound    pvc-4a2e4de2   2Gi        RWO            standard-rwo
dis-db-intradayuser   Bound    pvc-3952fce1   2Gi        RWO            standard-rwo
...

You can then get information about that storage class. If ALLOWVOLUMEEXPANSION is listed as true, then it is expandable.

$ kubectl get storageclass standard-rwo
NAME           PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION
standard-rwo   pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   true

How do I increase the DIS volume size used for /db/Intraday or /db/IntradayUser?

First check to see that your volume's storage class allows volume expansion. For storage classes that allow it, you can increase the volume sizes by adding or editing this section of the override YAML file you reference in your helm upgrade command. You can increase the size of one or both, as shown here.

Note

Some caveats may apply depending on your Kubernetes version and volume type; see the Kubernetes documentation for more details.

dis:
  intradaySize: 20Gi
  intradayUserSize: 20Gi

Then scale down your deployment with the script provided in the Deephaven helm chart and run your helm upgrade as documented in the upgrade guide.

How do I increase the volume size used for `/var/log/deephaven/binlogs` in a `deployment`?

First check to see that your volume's storage class allows volume expansion. The binlogs used in the Deephaven helm chart for a deployment are a persistent volume claim that can be expanded by adding or editing the configuration in the override YAML file you reference in your helm upgrade command.

# This example shows increasing the LAS deployment's binlogs size.
resources:
  las:
    binlogsSize: 3Gi

Then scale down your deployment with the script provided in the Deephaven helm chart and run your helm upgrade as documented in the upgrade guide.

How do I increase the volume size used for `/var/log/deephaven/binlogs` in a `statefulset`?

First check to see that your volume's storage class allows volume expansion. The binlogs persistent volume claim (PVC) is part of the definition of the statefulset, so it is handled differently than a binlog PVC used for a deployment. In this example, we upgrade the capacity of the controller's binlogs.

Identify the binlog PVC names by running kubectl get pvc. If you have two pods for the controller, you will see two PVCs named binlogs-controller-0 and binlogs-controller-1. To edit a PVC, use the command kubectl edit pvc binlogs-controller-0.

Change the YAML value at spec.resources.requests.storage to your desired capacity, then save and exit. In this YAML snippet, we increase a volume from its previous value to 10Gi.

...
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
...

Edit all PVCs in the statefulset before continuing. The volume should resize asynchronously, and you can check progress with kubectl get pvc binlogs-controller-0 or kubectl describe pvc binlogs-controller-0.

You should know the override YAML used to install or upgrade the Helm chart. If you don't have it available, create a new file by running helm get values my-deephaven-deployment-name. Add or edit the resources section to include the new capacity value of your PVCs. If this is not updated, future Helm upgrades may result in an error.

...
resources:
  las:
    binlogsSize: 10Gi
...

The next step is deleting the statefulset.

$ kubectl delete statefulset --cascade=orphan controller

Then scale down your deployment with the script provided in the Deephaven helm chart and run your helm upgrade as documented in the upgrade guide, using the override YAML file edited as part of this process.

How do I decrease the size of a persistent volume?

There is no direct support for decreasing a volume's capacity in Kubernetes.

How do I prevent a Persistent Query's worker pod from getting removed immediately after the process terminates?

In the settings tab of the Persistent Query configuration editor, use the Extra Env Variables input box to define WORKER_POD_EXIT_WAIT_SEC to delay the pod removal; for example, WORKER_POD_EXIT_WAIT_SEC=600.

Use this judiciously -- only when required for troubleshooting -- as extended delays to pod deletions will cause query restarts and system shutdowns to wait until the processes terminate.

How do I troubleshoot Deephaven service pods that get killed by Kubernetes?

When Kubernetes kills and restarts a pod due to resource constraints or other issues, you'll see a non-zero RESTARTS count in the pod listing. To identify problematic pods, first check which ones have been restarted:

$ kubectl get pods

You'll see output like this:

NAME                                            READY   STATUS      RESTARTS      AGE
aclwriter-7bdd5fb484-m7lzb                      2/2     Running     0             21h
authserver-59785875-vwgxg                       2/2     Running     0             21h
configuration-server-854d6f7fc4-drm44           2/2     Running     2 (11h ago)   21h
controller-5566fd5f9c-kvj9b                     2/2     Running     0             21h
dis-5cd4bcf9df-h2xln                            2/2     Running     0             21h
envoy-555575d764-qkhr8                          1/1     Running     0             21h
las-87ff8f758-gsmbc                             2/2     Running     0             21h
management-shell-f5bbf498b-vncdf                1/1     Running     0             21h
merge-server-7f87db556b-md7np                   2/2     Running     0             21h
merge-server-7f87db556b-md7np-worker-40cc1acb   1/1     Running     0             21h
my-etcd-release-0                               1/1     Running     0             7d19h
my-etcd-release-1                               1/1     Running     0             7d19h
my-etcd-release-2                               1/1     Running     0             7d19h
query-server-548d557649-rsmsh                   2/2     Running     0             21h
query-server-548d557649-rsmsh-worker-41ada097   1/1     Running     0             21h
query-server-548d557649-rsmsh-worker-7451dd17   1/1     Running     0             5h25m
query-server-548d557649-rsmsh-worker-8f27e38b   1/1     Running     0             21h
query-server-548d557649-rsmsh-worker-e5f73497   1/1     Running     0             5h25m
query-server-548d557649-rsmsh-worker-fa4c3344   1/1     Running     0             13m
statusdashboard-66c45f455c-jg97l                2/2     Running     0             21h
tdcp-84fd9d9577-6sxw6                           2/2     Running     0             21h
test-k8s-grizzly-pre-release-hook-phtfn         0/1     Completed   0             21h
webapi-56784d5774-jxm27                         2/2     Running     1 (9h ago)    21h

Note the three restarts above. There are a couple of common possibilities causing this that can be corrected by reconfiguring the deployments.

To determine why a pod was terminated and restarted, examine its details using kubectl describe:

$ kubectl describe pod configuration-server-854d6f7fc4-drm44

...
    State:          Running
      Started:      Wed, 09 Jul 2025 00:22:35 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 08 Jul 2025 13:59:59 +0200
      Finished:     Wed, 09 Jul 2025 00:22:34 +0200
    Ready:          True
    Restart Count:  2
    Limits:
      ephemeral-storage:  1Gi
      memory:             1Gi
    Requests:
      cpu:                500m
      ephemeral-storage:  1Gi
      memory:             1Gi

Most Deephaven pods contain two containers (a service container and a tailer container). The kubectl describe output will show separate sections for each. While it's more common for service containers to be killed due to resource constraints, check both sections for the Last State information shown above.

In this example, the Reason: OOMKilled indicates the configuration-server container was terminated due to exceeding its memory limit. The 1Gi default memory limit works for smaller deployments in GKE, but is often insufficient on other platforms like EKS.

To address this, you can increase the resource allocation using kubectl set resources:

kubectl set resources deployment/configuration-server -c=configuration-server --limits=memory=1.5Gi --requests=memory=1.5Gi

Specify both the deployment name and container name (-c parameter).
This patch is temporary and will be overwritten by the next helm upgrade.
Applying the patch will cause Kubernetes to replace the affected pod with one using the new settings.

Set values persistently

For a permanent solution that persists across Helm upgrades, add resource overrides to your values YAML file (commonly named my-values.yaml):

# Memory overrides for resource-intensive services
resources:
  configuration-server:
    requests:
      memory: 1.5Gi
    limits:
      memory: 1.5Gi
  webapi:
    requests:
      memory: 1.5Gi
    limits:
      memory: 1.5Gi

These overrides will be applied the next time you run a Helm upgrade that references this values file:

$ helm upgrade --install test-k8s-grizzly deephaven --values ../my-values.yaml --debug

Known changes from defaults needed on platforms other than GKE:

Platform	Deployment	Container	Value (request and limit)
AKS	dis	dis	7Gi
AKS	webapi	webapi	1.5Gi
EKS	configuration-server	configuration-server	1.5Gi
EKS	webapi	webapi	1.5Gi

Ephemeral storage allocations

Another possibility for Kubernetes killing a container is because it exceeded its ephemeral storage. Ephemeral storage in Deephaven Kubernetes containers is largely used for log files. If logging activity is high, due to heavy use, or debug logging enabled, it may be necessary to increase the ephemeral-storage requests and limits settings. These can be patched or overridden the same way as memory requests and limits were modified in the examples above.

How do I debug a process using a remote debugger?

Follow the steps below to debug a non-worker Deephaven Java process with a remote debugger.

First, configure your override file with port and debug information by adding configurations for the process you want to debug.

# Use any port you like, but debugPort should match what is in the jvmArgsUser string
# Refer to the `process` section of the `values.yaml` file in the Deephaven chart to
# see what other processes may be configured.
process:
  query-server:
    debugPort: 5005
    jvmArgsUser: '-Xdebug -agentlib:jdwp=transport=dt_socket,address=0.0.0.0:5005,server=y,suspend=n'

Set up port forwarding from your local machine to the pod using the debugPort configured earlier. If the debugPort is not known, find the name of the pod you want to debug by running kubectl get pods.

$ kubectl port-forward query-server-c785f4445-mrwsj 5005:5005

After your pod has restarted, you can run a remote debugger as normal.

How do I make a certificate or other piece of external data available within my Persistent Queries?

You can create a secret and mount its data in workers by adding a workerExtraSecrets section to your chart override YAML and running a helm upgrade as documented in the upgrade guide. All keys of the secret will be mapped as files under the workerExtraSecrets.mountPath on your worker pods.

workerExtraSecrets:
  - name: "my-worker-secret-1"
    mountPath: "/data/secret-1"
    secretName: "my-secret-name"

Similarly, you can also mount an existing persistent volume claim to worker pods with a workerExtraVolumes section like this.

workerExtraVolumes:
  - name: "my-worker-vol-1"
    claimName: "my-worker-pvc"
    mountPath: "/data/vol-1"

How do I look at the disks used by the Data Import Server?

You may want to look at the mounted disks used by the Data Import Server (DIS) to see how much space is available or perform other actions such as removing old data. You can do this by examining the filesystem on the DIS pod. For example, to see the available space on the disks, run:

kubectl exec -it -c dis deploy/dis -- df -h

What if the DIS isn't running?

If the DIS isn't running (which can happen if the volumes are full), you'll need to create a temporary pod which mounts the same volumes as the DIS. This procedure is only helpful to debug and resolve issues when the DIS is down, because the new pod won't be able to mount the PVCs while the DIS is using them.

First, find the names of the PVCs used by the DIS:

kubectl get pvc | grep dis

This will return a result like this:

dis-binlogs                                        Bound    pvc-9b171cce-7681-4216-868d-c1bb5feb3db8          2Gi        RWO            standard-rwo   <unset>                 279d
dis-db-intraday                                    Bound    pvc-5b1ecea1-6071-4543-96d6-c0a17a76a615          10Gi       RWO            standard-rwo   <unset>                 279d
dis-db-intradayuser                                Bound    pvc-9f7fe622-1133-41f5-aa8b-963eeb669193          10Gi       RWO            standard-rwo   <unset>                 279d

Then, create a pod with the same PVCs mounted. You can use the following YAML to create a pod that mounts the same volumes as the DIS, using the names from the previous command.

You'll need the image name currently deployed in the management shell pod. You can find it with a command like this:

kubectl describe pod $(k get pods | grep management-shell | awk '{print $1}') | grep "Image:"

Then create a file called my-temp-pod.yaml. You can use the following YAML as a template, but be sure to change the image to the one you found in the previous command. If the disk PVC names are different, you'll also need to update both those sections.

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: my-temp-pod
  name: my-temp-pod
spec:
  containers:
    - command: ['/bin/sh', '-c', 'sleep infinity']
      image: us-central1-docker.pkg.dev/qa-k8s-clusters/images/jdk17/20240517/deephaven_management:1.20240517.437
      name: my-temp-container
      volumeMounts:
        - mountPath: /var/log/deephaven/binlogs
          name: my-bin
        - mountPath: /db/Intraday
          name: my-intraday
        - mountPath: /db/IntradayUser
          name: my-user
  terminationGracePeriodSeconds: 1
  volumes:
    - name: my-bin
      persistentVolumeClaim:
        claimName: dis-binlogs
    - name: my-intraday
      persistentVolumeClaim:
        claimName: dis-db-intraday
    - name: my-user
      persistentVolumeClaim:
        claimName: dis-db-intradayuser

Create the pod:

kubectl create -f my-temp-pod.yaml

Now, log on to it with a bash shell and execute commands like df to view the disks, and purge data if needed:

kubectl exec -it my-temp-pod -- bash

When you're done, delete the pod. Don't forget this or your system won't start correctly due to the mounted PVCs:

k delete pod my-temp-pod