Troubleshooting a Podman deployment
For standard Podman support questions, refer directly to the Podman troubleshooting docs.
The status of Podman pods can be viewed with podman pod ls
:
POD ID NAME STATUS CREATED INFRA ID # OF CONTAINERS
765f71d7996e dh-infra-pod Running 2 weeks ago 694adc64e997 2
The status of Podman containers can be viewed with podman container ls
:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
694adc64e997 localhost/podman-pause:4.9.4-rhel-1730457905 2 weeks ago Up 2 weeks 10.128.1.118:8000->8000/tcp 765f71d7996e-infra
882b0bea033b localhost/dh-infra:latest 2 weeks ago Up 2 weeks 10.128.1.118:8000->8000/tcp dh-infra
Multiple containers can run in a pod. Containers share access to pod resources. The podman-pause
labeled container is the infra container, which coordinates the shared kernel namespace of the pod.
These commands show only running pods or containers. To also see failed or stopped pods or containers, add -a
to the commands.
Unable to podman load
images
This is typically a permissions or configuration issue. Ensure that the user running Podman has configured entries in /etc/subuid
and /etc/subgid
.
Unable to build images
Images must be built on an Intel-type system. Generally, podman build
provides a good indication of what went wrong in its output.
Some common causes of build failure are:
- No Internet access and no alternate repositories configured for needed DNF packages being used in the build process.
- Insufficient disk space because of old images or build cache. To delete images not currently being used by containers/pods and to clear the build cache, run:
podman system prune -a
start_command.sh
invocation hangs at admin_init
There are two known possibilities here - a problem resolving the name of the pod from within the pod, or "extra" instances of etcd left running from a previous pod.
Name resolution within the pod
When processes within the pod attempt to connect to other processes also running within the pod, they need to resolve the pod's internal IP address from the pod's DNS name. The admin_init.sh
gets the etcd endpoint to connect to from /etc/sysconfig/deephaven/etcd/client/root/endpoints
. This is a text file that can be viewed with cat
or vi
, etc. The endpoint URI will be of the form, https://mydeephaven.myorg.com:2379
. The admin_init.sh
script will resolve the IP address of mydeephaven.myorg.com
from /etc/hosts
.
The resolution process, and the presence of etcd, can be checked from within the pod with:
curl --cacert /etc/sysconfig/deephaven/etcd/client/root/cacert https://mydeephaven.myorg.com:2379
The latter part of the command is the endpoint URI from the endpoints
file above. It should return 404 page not found
. If it instead returns curl: (6) Could not resolve host: ...
, then there is a problem with the entry in /etc/hosts
.
The /etc/hosts
file should contain entries for short and long names of the pod (e.g. mydeephaven
and mydeephaven.myorg.com
) with the IP loopback address of localhost (127.0.0.1). Normally this should be sufficient.
In some early deployments of Deephaven Podman, the pods internal IP address was used instead of the loopback address. If there is configuration that is expecting this internal address, it can be found in the output of ifconfig
:
ifconfig
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 1168377 bytes 671693078 (640.5 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1168377 bytes 671693078 (640.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
tap0: flags=67<UP,BROADCAST,RUNNING> mtu 65520
inet 10.0.2.100 netmask 255.255.255.0 broadcast 10.0.2.255
inet6 fe80::f0c6:f1ff:fe13:5487 prefixlen 64 scopeid 0x20<link>
inet6 fd00::f0c6:f1ff:fe13:5487 prefixlen 64 scopeid 0x0<global>
ether f2:c6:f1:13:54:87 txqueuelen 1000 (Ethernet)
RX packets 27564 bytes 4297036 (4.0 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 27408 bytes 2125868 (2.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
10.0.2.100
is typical for Podman on Linux. Podman on other platforms may be different.
Orphaned etcd processes
In some cases, Podman may stop a pod without stopping all the processes used by the pod. The cause of this is not currently known. These "orphaned" processes can cause problems when launching a Deephaven Podman pod.
If start_command.sh
is executed for an infra pod on a host that has leftover etcd processes running, admin_init
will hang while attempting to manage the old etcd process.
To diagnose this situation, stop the pod:
podman pod stop --time 90 dh-infra
Then, check running processes on the host with:
ps -ef | grep "etcd "
If no infra pods are running on the host, and no other etcd instances are running on the host, this should return only one line for grep
itself. If other process entries are returned, these orphaned processes must be kill
ed before start_command.sh
is tried again.
# example of leftover etcd process entries
deephav+ 3943722 3931540 0 Jan15 ? 00:00:00 sudo -i -u etcd -g irisadmin /bin/bash -c echo "etcd: Starting etcd (with config file /etc/etcd/dh/latest/config.yaml, logging to /var/log/dh-etcd.log)..."; GOMAXPROCS=16 /usr/bin/etcd --config-file '/etc/etcd/dh/latest/config.yaml' >/var/log/dh-etcd.log 2>&1; ETCD_EXIT_STATUS=$?; echo "etcd: etcd exited with status $ETCD_EXIT_STATUS! etcd bash shell exiting."; exit $ETCD_EXIT_STATUS;
3713482 3943727 3943722 0 Jan15 ? 00:00:00 /bin/bash -c echo "etcd: Starting etcd (with config file /etc/etcd/dh/latest/config.yaml, logging to /var/log/dh-etcd.log)..."; GOMAXPROCS=16 /usr/bin/etcd --config-file '/etc/etcd/dh/latest/config.yaml' >/var/log/dh-etcd.log 2>&1; ETCD_EXIT_STATUS=$?; echo "etcd: etcd exited with status ! etcd bash shell exiting."; exit ;
3713482 3943757 3943727 0 Jan15 ? 00:02:45 /usr/bin/etcd --config-file /etc/etcd/dh/latest/config.yaml
Important
Before killing any suspected orphaned etcd processes, double-check that they are not related to other instances of etcd or Deephaven Podman infra pods running correctly on the host. One item to verify is the ownership of the parent process - deephav...
in the example above. Also, if there are multiple processess, verify that they are related. In the example above, the first is the parent of the second, and the second is the parent of the third.
Pauses for a relatively long time (60 seconds plus) at two points during initialization
This can be seen when viewing the initialization logs interactively with:
podman logs -f dh-infra
as relatively long pauses where nothing happens.
Once the script continues after the first long pause, it will generally show a message like: root_prepare execution time: 208.554s
, where 208.554s
is some amount of time in excess of 60 seconds. These pauses are caused by large numbers of old log files in a mounted log volume. Deleting no-longer needed logs from the log volume will reduce or eliminate the pause.
System not usable after ungraceful shutdown and restart with podman pod start
This has been seen intermittently when the host or the podman process has been terminated ungracefully/abrubtly.
If the system is not usable from the Web UI after such a restart scenario:
-
Log into each of the containers with:
podman exec -it <container name> bash
where container name is
dh-infra
,dh-query
, etc. -
In each node, run:
/usr/illumon/latest/bin/dh_monit reload
to reinitialze monit itself, and run:
/usr/illumon/latest/bin/dh_monit restart all
to ensure that all Deephaven processes are restarted.
Avoid machine start on Linux
On Linux, a separately-spawned VM is not required. If Podman is installed, simply loading an image and starting a system is sufficient.
For anyone used to using Podman on MacOS, doing the following cycle may have become second nature. These commands are not required on Linux.
podman machine init
podman machine start
If run unintentionally on Linux, the following error will be seen:
Error: exec: "qemu-system-x86_64": executable file not found in $PATH
start_command.sh
invocation does not result in a usable pod
To tail the log file of a running (or most recently running) infra container, use:
podman logs -f dh-infra
Or for a query container:
`podman logs -f dh-query
In most cases, the infra container is the more likely place to find causes of problems because the bulk of initial cluster configuration is done through the entry point of the infra container.
The dh-init
process that initializes/installs/upgrades a Deephaven Podman deployment takes a few minutes to run on the Deephaven infra node. Viewing the logs during this time will show it doing things like installing and configuring Python, configuring etcd (admin-init.sh
), and verifying started status of key services like the configuration_server
and authentication_server
.
If an error occurs relatively early in the dh_init
process, the pod will go immediately to a failed state at that point and will not be accessible. The types of issues that cause this kind of failure are usually related to something configured on the host, rather than inside the pod, and will need to be corrected before trying again to start the pod with start_command.sh
. In rare cases, an early failure may also block podman logs
from being able to access the logs from the pod. If this is the case, you can create a small batch file to run the start command and get the logs, something like:
./start_command ...
sleep 1
podman logs -f dh-infra
Most failures that occur later, including those that cause services to fail on startup, are caught by the dh_init
process, and enter a 10-minute (600-second) wait period before shutting down the pod. This is to allow a user to shell into the pod and inspect log files and configuration to find the cause of the failure. If a pod has already exceeded its 10 minutes and has been shut down, it can be restarted with podman pod start <pod_name>
. For an infra pod, the default pod name is dh_infra_pod
.
A container that is running can be accessed using:
podman exec -it <container_name> bash
For example:
podman exec -it dh-infra bash
Some common failure modes in dh_init
include:
-
Fails with "Cluster initialization already started but not complete!" in logs. This is usually caused by not deleting the running configuration data. Ensure
sudo rm -vrf "${VOLUME_BASE_DIR:?}"/deephaven-shared-config/{auth,dh-config,etcd,installation,trust,cluster_status}
has been run. It needs to be run every time before runningstart_command.sh
for a cluster that has already been started once. -
Fails to complete init, and ends up in 10-minute debug window. Run the following to get into the shell:
podman exec -it dh-infra bash
Then checking with
dh_monit
shows problems with theconfiguration_server
and a failedauthentication_server
. Theauthentication_server
current log (cat /var/log/deephaven/authentication_server/AuthenticationServer.log.current
) shows:Could not start authentication server: java.lang.RuntimeException: No trust store found at "/deephaven-tls/truststore.p12"
This is usually caused by not having the correct issuing certificate in "${VOLUME_BASE_DIR:?}"/deephaven-tls/ca.crt
.
- Fails with
access denied
orFile exists
accessing some file or directory duringdh_init
. This is usually caused by not completely deleting running configuration, or by not applying SELinuxchcon
settings. Ensuresudo rm -vrf "${VOLUME_BASE_DIR:?}"/deephaven-shared-config/{auth,dh-config,etcd,installation,trust,cluster_status}
andchcon -vR -u system_u -r object_r -t container_file_t "${VOLUME_BASE_DIR:?}"
have been run beforestart_command.sh
is called. dh_init
does not complete successfully. Checking during the 10-minute debug window shows most services say "Start Pending" indh_monit summary
. The configuration_server process starts and stops in a loop. This can be caused by attaching ajava_lib volume
with contents that are not readable to Deephaven processes in the container due to incorrect permissions.- Fails with a message stating
find: 'UNKNOWN' is not the name of a known user
. This is caused bydh_root_prepare
being unable to recognize the owner or group of a file or directory being provided from outside the container. Typically this is because one of the volume paths includes files or directories not owned by the user running Podman. This can be corrected with:sudo chown -R <user_running_podman>:<user_running_podman> $VOLUME_BASE_DIR
. Note that this usually requires sudo rights (sudo prefix to the command line). In some cases, dependeding on how and where the directories were created, you may be able to correct ownership without usingsudo
. If you do not havesudo
rights, try withoutsudo
, and see if that helps. If not, contact an administrator to update the permissions for you.
General Deephaven troubleshooting procedures can also help to isolate Deephaven Podman initialization issues.