Troubleshooting a Kubernetes installation
Access logs
Run kubectl logs -c <process> <pod>
to get the logs for a given process. To avoid needing to copy-paste the pod name, you can use kubectl
to get the pod name via label, and pass that into the logs
command:
kubectl logs $(kubectl get pod -o name -l app=webapi |cut -d/ -f 2) -c webapi
Shell access
If you need to examine the installation, you can use a management-shell pod. All the volumes are mounted read-write, so you can also update files as necessary.
kubectl exec deploy/management-shell --tty --stdin -- /bin/bash
Leftover worker pods
If you have leftover worker pods, they may hold onto PVCs (PersistentVolumeClaims), preventing a new installation.
For example:
$ helm install merry-meerkat ./deephaven/ -f values.yaml
Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: PersistentVolume "dhprefix-pv-db-systems" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "merry-meerkat": current value is "limber-llama"
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
dhprefix-nfs-pvc-etcdbackup-deephaven Bound dhprefix-nfs-pv-etcdbackup-deephaven-2 1Gi RWX 25h
dhprefix-pvc-db-systems Terminating dhprefix-pv-db-systems 1Gi RWX 18h
dhprefix-pvc-db-tempfiles Terminating dhprefix-pv-db-tempfiles 1Gi RWX 18h
dhprefix-pvc-db-users Terminating dhprefix-pv-db-users 1Gi RWX 18h
dhprefix-pvc-db-venvs Terminating dhprefix-pv-db-venvs 1Gi RWX 18h
dhprefix-pvc-etc-sysconfig-deephaven Terminating dhprefix-pv-etc-sysconfig-deephaven 1Gi RWX 18h
dhprefix-pvc-var-log-deephaven Terminating dhprefix-pv-var-log-deephaven 1Gi RWX 18h
data-dhprefix-etcd-0 Bound pvc-bb30718e-2f36-490d-990e-d02a0da4adac 8Gi RWO standard 25h
pvc-dhprefix-nfs-server Bound pvc-bf7e3d0e-88c7-4be2-b6cc-6dd44b354c8a 10Gi RWO premium-rwo 27h
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
dhprefix-etcd-0 1/1 Running 0 25h
merge-server-f94cf8b9-k8n4n-worker-1 0/1 Completed 0 17h
merge-server-f94cf8b9-k8n4n-worker-6 0/1 Error 0 13h
merge-server-f94cf8b9-nzftj-worker-36 0/1 Error 0 17h
nfs-server-59c49c4cd7-zqh2f 1/1 Running 0 27h
query-server-76f8db9d94-544dv-worker-1 0/1 Completed 0 17h
query-server-76f8db9d94-544dv-worker-2 0/1 Completed 0 17h
query-server-76f8db9d94-544dv-worker-8 0/1 Error 0 13h
query-server-76f8db9d94-544dv-worker-9 0/1 Error 0 13h
query-server-76f8db9d94-glw9d-worker-69 0/1 Error 0 17h
query-server-76f8db9d94-glw9d-worker-70 0/1 Error 0 17h
This can be corrected by deleting those pods:
kubectl delete pods -l 'role=query-worker'
kubectl delete pods -l 'role=merge-worker'
If there are other leftover resources after uninstalling the release, they should also be removed via kubectl
, as in the following example.
Caution
Note that removing the PVCs for intraday data will delete any unmerged intraday data.
# Remove preinstall hook:
kubectl delete jobs,secrets -l app.kubernetes.io/instance=dhe-k8s-test
# Remove intraday data volumes:
kubectl delete jobs,secrets,pv,pvc -l app.kubernetes.io/instance=dhe-k8s-test
Some files in the NFS data directory, which includes caches and generated TLS/etcd keys, should be removed as well.
If the initial data was extracted to /exports/dhsystem
on the NFS server, then the
appropriate command to clean up the outdated configuration and caches (without deleting user or system data) would be:
sudo rm -vrf /exports/dhsystem/{db/TempFiles,etc}
Restart a pod
Most pods can be restarted by scaling their deployment down to 0 and then back to 1 pod with kubectl scale deployment <deployment-name> --replicas=0
followed by kubectl scale deployment <deployment-name> --replicas=1
.
Debug template syntax errors
If you have a syntax error, it is often not clear where it is coming from. To debug, first run the templating engine from helm:
helm template ./deephaven/ -f awsValues.yaml
If the YAML fails to render, you will get a message like the following:
Error: YAML parse error on deephaven/templates/acl_writer/service.yaml: error converting YAML to JSON: yaml: line 5: mapping values are not allowed in this context
Use the --debug
flag to render out invalid YAML to file:
# If you add --debug, you will end up with lots of YAML. You can limit the file to a single one with the `-s` option, as follows (redirected to `/tmp/t.yaml`):
helm template --debug ./deephaven/ -f awsValues.yaml -s 'templates/acl_writer/service.yaml' >/tmp/t.yaml
Running that file through yamllint
may point out your error:
$ yamllint /tmp/t
ERROR YAML Lint failed for 1 file 09:58:59
/tmp/t 09:58:59
ERROR bad indentation of a mapping entry (7:14) 09:58:59
4 | kind: Service
5 | metadata:
6 | name: aclwriter
7 | labels: app: aclwriter
------------------^
8 | spec:
9 | type: ClusterIP
If you have two separate YAML blocks, only the first one will print when the second one has errors, even with the --debug
and -s
options. You can comment out earlier blocks to get the block of interest to render.
You may also find the --validate
flag useful, as it will check your YAML against the Kubernetes object definitions and the state of the cluster.
Debug a pre-install hook
The pre-install hook logs can be retrieved with kubectl logs
. You may want to look at log files from an individual run or experiment with running different scripts. To do this, you may introduce a sleep after a failed pre-install script by including a debug.preInstall
entry in your values.yaml
as follows:
debug:
preInstall: true
Debug using a remote debugger
You can enable debugging of a remote process by following these steps:
- Configure your
my-values.yaml
file with port and debug information. - Upgrade the Deephaven Helm installation.
- Scale deployment(s) to be debugged down to 0 pods, then back to 1 to ensure that they have the required settings.
- Forward port from your host to the pod.
Add configurations to your my-values.yaml
file for the process you want to debug. See
Creating a Service
for more information on Kubernetes services and port settings in general.
# Use whatever port you want
process:
controller:
jvmArgsUser: '-Xdebug -agentlib:jdwp=transport=dt_socket,address=0.0.0.0:5005,server=y,suspend=n'
Note
Refer to the process
section of the values.yaml
file in the Deephaven chart to see what other processes may be
configured for debugging in addition to controller.
Once the above settings have been configured, you must upgrade the helm chart. This is similar to what was done
in the chart installation. Run the following in the root directory where the Deephaven chart directory is:
helm upgrade deephaven-helm-release-name ./deephaven/ -f my-values.yaml --set image.tag=<deephaven-release-version> --debug
The easiest way to restart the component is to run kubectl scale deployment/<deployment-name> --replicas=0
followed by
kubectl scale deployment/<deployment_name> --replicas=1
.
To see the names of all deployments, you may run kubectl get deployments
.
Next, you should enable port forwarding from your local machine to the deployment with a command like the following that forwards local port 5005 to port 5005 on the controller pod:
kubectl port-forward deployment/<deployment-name> 5005:5005
After your pod has restarted, you can run a remote debugger as normal.
Debug the Swing console
To debug the Swing console, add similar JVM arguments to your getdown.global
file, then reload the CUS.
You can then debug your locally launched IrisConsole process.