Troubleshooting a Podman deployment

For standard Podman support questions, refer directly to the Podman troubleshooting docs.

The status of Podman pods can be viewed with podman pod ls:

POD ID        NAME          STATUS      CREATED      INFRA ID      # OF CONTAINERS
765f71d7996e  dh-infra-pod  Running     2 weeks ago  694adc64e997  2

The status of Podman containers can be viewed with podman container ls:

CONTAINER ID  IMAGE                                         COMMAND     CREATED      STATUS      PORTS                        NAMES
694adc64e997  localhost/podman-pause:4.9.4-rhel-1730457905              2 weeks ago  Up 2 weeks  10.128.1.118:8000->8000/tcp  765f71d7996e-infra
882b0bea033b  localhost/dh-infra:latest                                 2 weeks ago  Up 2 weeks  10.128.1.118:8000->8000/tcp  dh-infra

Multiple containers can run in a pod. Containers share access to pod resources. The podman-pause labeled container is the infra container, which coordinates the shared kernel namespace of the pod.

These commands show only running pods or containers. To also see failed or stopped pods or containers, add -a to the commands.

Unable to `podman load` images

This is typically a permissions or configuration issue. Ensure that the user running Podman has configured entries in /etc/subuid and /etc/subgid.

Unable to build images

Images must be built on an Intel-type system. Generally, podman build provides a good indication of what went wrong in its output.

Some common causes of build failure are:

No Internet access and no alternate repositories configured for needed DNF packages being used in the build process.
Insufficient disk space because of old images or build cache. To delete images not currently being used by containers/pods and to clear the build cache, run:

podman system prune -a

`start_command.sh` invocation hangs at `admin_init`

There are two known possibilities here - a problem resolving the name of the pod from within the pod, or "extra" instances of etcd left running from a previous pod.

Name resolution within the pod

When processes within the pod attempt to connect to other processes also running within the pod, they need to resolve the pod's internal IP address from the pod's DNS name. The admin_init.sh gets the etcd endpoint to connect to from /etc/sysconfig/deephaven/etcd/client/root/endpoints. This is a text file that can be viewed with cat or vi, etc. The endpoint URI will be of the form, https://mydeephaven.myorg.com:2379. The admin_init.sh script will resolve the IP address of mydeephaven.myorg.com from /etc/hosts.

The resolution process, and the presence of etcd, can be checked from within the pod with:

curl --cacert /etc/sysconfig/deephaven/etcd/client/root/cacert  https://mydeephaven.myorg.com:2379

The latter part of the command is the endpoint URI from the endpoints file above. It should return 404 page not found. If it instead returns curl: (6) Could not resolve host: ..., then there is a problem with the entry in /etc/hosts.

The /etc/hosts file should contain entries for short and long names of the pod (e.g. mydeephaven and mydeephaven.myorg.com) with the IP loopback address of localhost (127.0.0.1). Normally this should be sufficient.

In some early deployments of Deephaven Podman, the pods internal IP address was used instead of the loopback address. If there is configuration that is expecting this internal address, it can be found in the output of ifconfig:

ifconfig
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1168377  bytes 671693078 (640.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1168377  bytes 671693078 (640.5 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tap0: flags=67<UP,BROADCAST,RUNNING>  mtu 65520
        inet 10.0.2.100  netmask 255.255.255.0  broadcast 10.0.2.255
        inet6 fe80::f0c6:f1ff:fe13:5487  prefixlen 64  scopeid 0x20<link>
        inet6 fd00::f0c6:f1ff:fe13:5487  prefixlen 64  scopeid 0x0<global>
        ether f2:c6:f1:13:54:87  txqueuelen 1000  (Ethernet)
        RX packets 27564  bytes 4297036 (4.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 27408  bytes 2125868 (2.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

10.0.2.100 is typical for Podman on Linux. Podman on other platforms may be different.

Orphaned etcd processes

In some cases, Podman may stop a pod without stopping all the processes used by the pod. The cause of this is not currently known. These "orphaned" processes can cause problems when launching a Deephaven Podman pod.

If start_command.sh is executed for an infra pod on a host that has leftover etcd processes running, admin_init will hang while attempting to manage the old etcd process.

To diagnose this situation, stop the pod:

podman pod stop --time 90 dh-infra

Then, check running processes on the host with:

ps -ef | grep "etcd "

If no infra pods are running on the host, and no other etcd instances are running on the host, this should return only one line for grep itself. If other process entries are returned, these orphaned processes must be killed before start_command.sh is tried again.

# example of leftover etcd process entries
deephav+ 3943722 3931540  0 Jan15 ?        00:00:00 sudo -i -u etcd -g irisadmin /bin/bash -c echo "etcd: Starting etcd (with config file /etc/etcd/dh/latest/config.yaml, logging to /var/log/dh-etcd.log)..."; GOMAXPROCS=16 /usr/bin/etcd --config-file '/etc/etcd/dh/latest/config.yaml' >/var/log/dh-etcd.log 2>&1; ETCD_EXIT_STATUS=$?; echo "etcd: etcd exited with status $ETCD_EXIT_STATUS! etcd bash shell exiting."; exit $ETCD_EXIT_STATUS;
3713482  3943727 3943722  0 Jan15 ?        00:00:00 /bin/bash -c echo "etcd: Starting etcd (with config file /etc/etcd/dh/latest/config.yaml, logging to /var/log/dh-etcd.log)..."; GOMAXPROCS=16 /usr/bin/etcd --config-file '/etc/etcd/dh/latest/config.yaml' >/var/log/dh-etcd.log 2>&1; ETCD_EXIT_STATUS=$?; echo "etcd: etcd exited with status ! etcd bash shell exiting."; exit ;
3713482  3943757 3943727  0 Jan15 ?        00:02:45 /usr/bin/etcd --config-file /etc/etcd/dh/latest/config.yaml

Important

Before killing any suspected orphaned etcd processes, double-check that they are not related to other instances of etcd or Deephaven Podman infra pods running correctly on the host. One item to verify is the ownership of the parent process - deephav... in the example above. Also, if there are multiple processess, verify that they are related. In the example above, the first is the parent of the second, and the second is the parent of the third.

Pauses for a relatively long time (60 seconds plus) at two points during initialization

This can be seen when viewing the initialization logs interactively with:

podman logs -f dh-infra

as relatively long pauses where nothing happens.

Once the script continues after the first long pause, it will generally show a message like: root_prepare execution time: 208.554s, where 208.554s is some amount of time in excess of 60 seconds. These pauses are caused by large numbers of old log files in a mounted log volume. Deleting no-longer needed logs from the log volume will reduce or eliminate the pause.

System not usable after ungraceful shutdown and restart with `podman pod start`

This has been seen intermittently when the host or the podman process has been terminated ungracefully/abrubtly.

If the system is not usable from the Web UI after such a restart scenario:

Log into each of the containers with:
```
 podman exec -it <container name> bash
```
where container name is dh-infra, dh-query, etc.
In each node, run:
```
 /usr/illumon/latest/bin/dh_monit reload
```
to reinitialze monit itself, and run:
```
/usr/illumon/latest/bin/dh_monit restart all
```
to ensure that all Deephaven processes are restarted.

Avoid machine start on Linux

On Linux, a separately-spawned VM is not required. If Podman is installed, simply loading an image and starting a system is sufficient.

For anyone used to using Podman on MacOS, doing the following cycle may have become second nature. These commands are not required on Linux.

podman machine init
podman machine start

If run unintentionally on Linux, the following error will be seen:

Error: exec: "qemu-system-x86_64": executable file not found in $PATH

`start_command.sh` invocation does not result in a usable pod

To tail the log file of a running (or most recently running) infra container, use:

podman logs -f dh-infra

Or for a query container:

`podman logs -f dh-query

In most cases, the infra container is the more likely place to find causes of problems because the bulk of initial cluster configuration is done through the entry point of the infra container.

The dh-init process that initializes/installs/upgrades a Deephaven Podman deployment takes a few minutes to run on the Deephaven infra node. Viewing the logs during this time will show it doing things like installing and configuring Python, configuring etcd (admin-init.sh), and verifying started status of key services like the configuration_server and authentication_server.

If an error occurs relatively early in the dh_init process, the pod will go immediately to a failed state at that point and will not be accessible. The types of issues that cause this kind of failure are usually related to something configured on the host, rather than inside the pod, and will need to be corrected before trying again to start the pod with start_command.sh. In rare cases, an early failure may also block podman logs from being able to access the logs from the pod. If this is the case, you can create a small batch file to run the start command and get the logs, something like:

./start_command ...
sleep 1
podman logs -f dh-infra

Most failures that occur later, including those that cause services to fail on startup, are caught by the dh_init process, and enter a 10-minute (600-second) wait period before shutting down the pod. This is to allow a user to shell into the pod and inspect log files and configuration to find the cause of the failure. If a pod has already exceeded its 10 minutes and has been shut down, it can be restarted with podman pod start <pod_name>. For an infra pod, the default pod name is dh_infra_pod.

A container that is running can be accessed using:

podman exec -it <container_name> bash

For example:

podman exec -it dh-infra bash

Some common failure modes in dh_init include:

Fails with "Cluster initialization already started but not complete!" in logs. This is usually caused by not deleting the running configuration data. Ensure sudo rm -vrf "${VOLUME_BASE_DIR:?}"/deephaven-shared-config/{auth,dh-config,etcd,installation,trust,cluster_status} has been run. It needs to be run every time before running start_command.sh for a cluster that has already been started once.
Fails to complete init, and ends up in 10-minute debug window. Run the following to get into the shell:
```
podman exec -it dh-infra bash
```
Then checking with dh_monit shows problems with the configuration_server and a failed authentication_server. The authentication_server current log (cat /var/log/deephaven/authentication_server/AuthenticationServer.log.current) shows:
```
Could not start authentication server: java.lang.RuntimeException: No trust store found at "/deephaven-tls/truststore.p12"
```

This is usually caused by not having the correct issuing certificate in "${VOLUME_BASE_DIR:?}"/deephaven-tls/ca.crt.

Fails with access denied or File exists accessing some file or directory during dh_init. This is usually caused by not completely deleting running configuration, or by not applying SELinux chcon settings. Ensure sudo rm -vrf "${VOLUME_BASE_DIR:?}"/deephaven-shared-config/{auth,dh-config,etcd,installation,trust,cluster_status} and chcon -vR -u system_u -r object_r -t container_file_t "${VOLUME_BASE_DIR:?}" have been run before start_command.sh is called.
dh_init does not complete successfully. Checking during the 10-minute debug window shows most services say "Start Pending" in dh_monit summary. The configuration_server process starts and stops in a loop. This can be caused by attaching a java_lib volume with contents that are not readable to Deephaven processes in the container due to incorrect permissions.
Fails with a message stating find: 'UNKNOWN' is not the name of a known user. This is caused by dh_root_prepare being unable to recognize the owner or group of a file or directory being provided from outside the container. Typically this is because one of the volume paths includes files or directories not owned by the user running Podman. This can be corrected with: sudo chown -R <user_running_podman>:<user_running_podman> $VOLUME_BASE_DIR. Note that this usually requires sudo rights (sudo prefix to the command line). In some cases, dependeding on how and where the directories were created, you may be able to correct ownership without using sudo. If you do not have sudo rights, try without sudo, and see if that helps. If not, contact an administrator to update the permissions for you.

General Deephaven troubleshooting procedures can also help to isolate Deephaven Podman initialization issues.

Troubleshooting a Podman deployment

Unable to podman load images

Unable to build images

start_command.sh invocation hangs at admin_init