System status overview

This guide outlines system architecture and resources, along with management tools, logs, troubleshooting, and common failure modes. It is intended as a central point for topics relevant to diagnosing unexpected behaviors, failures, and system availability issues.

Most of the topics in this section are covered in more detail in other sections of the Deephaven Enterprise documentation.

Quick check of system operation

Details of these commands and underlying processes are in the sections below.

Check for running processes (on each server):
```
/usr/illumon/latest/bin/dh_monit summary
```
This should report all processes with a green OK status. The number of processes listed will vary depending on server configuration.
Attempt to start a Web console session: from a Chrome, Firefox, or Edge browser, go to https://<your_deephaven_fqdn>:8123/iriside (default port without Envoy) or https://<your_deephaven_fqdn>:8000/iriside (default port with Envoy). This should bring up a Web client login dialog.
Attempt to log in (credentials will vary based on your environment's setup). This should show Loading Workspace and eventually bring up a Web Console. If login fails with a "Cannot connect, query config has status 'Stopped'" error, try starting the WebClientData query using safe mode.
As a member of the iris-superusers group, click on Query Monitor. Clear any filters on this page. This should show the status of Persistent Queries (PQs) running in the cluster. At the least, the ImportHelperQuery, RevertHelperQuery, and WebClientData queries should be in Running state.
Click on an existing Code Studio, or click on (+) New and New Code Studio. This should bring up a New Console Connection dialog.

Attempt, one at a time, to start Core+ Consoles using each available Query Server, a reasonable amount of heap (e.g., 4GB), and with Groovy and/or Python depending on what is used in your environment.

For each Console connection, attempt simple queries to validate that data can be retrieved and displayed:

p = db.liveTable("DbInternal","ProcessEventLog").where("Date=today()")
d = db.historicalTable("LearnDeephaven","StockTrades").where("Date=`2017-08-25`")

p = db.live_table("DbInternal","ProcessEventLog").where("Date=today()")
d = db.historical_table("LearnDeephaven","StockTrades").where("Date=`2017-08-25`")

Note

The second query in the above example requires that the LearnDeephaven namespace and data set has been installed. See Install LearnDeephaven for details.

Common causes of outages

Disk space

One of most common causes of outages is an "out of disk space" condition on one of Deephaven's volumes. See the file system section below for details of different storage areas and their uses, as well as tips for checking usage.

Expired certificate

An expired certificate will block connecting to the Web UI and updating from the Deephaven Launcher tool. It will also cause workers to fail after launch since they will not be able to authenticate with their dispatchers. See checking certificates and replacing certificates for more.

Connectivity disruption

Deephaven processes communicate over the network. When there is a network disruption or loss of connectivity, processes may fail. Even when there is no application-level network traffic between the processes, regular heartbeats are sent. If heartbeat messages are not acknowledged, processes may consider their upstream dependencies to be unreachable, and exit.

System (re)configuration

Outages can happen when configuration management tools are used, which may reset settings, such as sudoers permissions.

OS settings, such as resource limits.
Required sudoers permissions.

System services, components, and configuration

A Deephaven installation consists of a set of Java processes running on one or more Linux servers. Production installations typically have three or more nodes. Most of the Deephaven configuration is stored in etcd. Having three etcd nodes (or more, but always an odd number) provides high availability access to configuration data and protects the configuration data in the case of loss of a node. For example, here is an overview of a three-node cluster:

A diagram of a three-node cluster

Processes

Deephaven servers will generally have three or more Deephaven processes running as system services. These processes are all interdependent, often relying on the configuration_server and authentication_server processes (which may be running on other nodes within the cluster). See process dependencies below.

Note

See also Details about Deephaven processes, including the ports on which they communicate, can be found in our Deephaven process ports, Deephaven services, and runbooks sections of the documentation.

Deephaven application processes

Deephaven uses M/Monit to manage system services. Only users with sudo -u irisadmin access (or equivalent custom user, based on the DH_MONIT_USER property during initial system setup) can use this tool. However, /usr/illumon/latest/bin/dh_monit contains a wrapper script for its use (to allow non-root users to use it without sudo).

Monit uses process PID files stored in /etc/deephaven/run to track the running status of the processes it manages. It will try periodically to start a process that failed to start. The update process for status is also rather slow, so it may be a minute or two before services being started show an OK status.

Common commands with /usr/illumon/latest/bin/dh_monit are:

/usr/illumon/latest/bin/dh_monit summary to show a list of processes and their statuses.
/usr/illumon/latest/bin/dh_monit start followed by all or a process name, to start a process or all processes.
/usr/illumon/latest/bin/dh_monit restart followed by all or a process name, to restart a process or all processes.
/usr/illumon/latest/bin/dh_monit stop followed by all or a process name, to stop a process or all processes.
/usr/illumon/latest/bin/dh_monit status to see detailed status information, such as uptime, for all processes, or, if specified, for a particular process.
/usr/illumon/latest/bin/dh_monit up to start all processes. This does not affect already running processes.
/usr/illumon/latest/bin/dh_monit down to stop all processes.
/usr/illumon/latest/bin/dh_monit reload to reload the configuration of M/Monit. This is normally needed only after adding or removing a process (conf files in /etc/sysconfig/deephaven/monit). It does not affect processes managed by M/Monit.

An optional flag, -b or --block, is available for the start, restart, stop, up, and down actions. This causes the script to wait for the actions to fully complete before returning. If either flag is used with another action, it will generate an error.

# To start all processes:
/usr/illumon/latest/bin/dh_monit start all

# or:
/usr/illumon/latest/bin/dh_monit up

# To stop all processes and wait for all processes to have stopped:
/usr/illumon/latest/bin/dh_monit stop all --block

# or:
/usr/illumon/latest/bin/dh_monit down --block

# To see a list of all processes and their uptime:
/usr/illumon/latest/bin/dh_monit status | grep "uptime\|Process"

etcd

etcd is a dependency for Deephaven, and may be installed on Deephaven server nodes, or on servers that are not part of the Deephaven cluster. If etcd is inaccessible, no Deephaven services will be able to start.

etcd is not managed by M/Monit. etcd's status can be checked with:

sudo systemctl status dh-etcd

The health of the etcd cluster can be checked by logging in (via ssh) to a Deephaven server that is running the configuration_server process, and executing:

# `irisadmin` or equivalent custom user based on the `DH_MONIT_USER` property during initial system setup
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status -w table

This should return a table showing all the nodes in the etcd cluster and some details about their roles and DB size. Something like:

+---------------------------+------------------+---------+---------+-----------+-----------+------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.10.10.53:2379  | 20f1fe672cdca01d |   3.5.5 |   91 MB |     false |     23922 |      67628 |
| https://10.10.10.54:2379  | 8cdf5ce8a296848f |   3.5.5 |   91 MB |     false |     23922 |      67629 |
| https://10.10.10.55:2379  | 4ea3e72f6e028887 |   3.5.5 |   91 MB |      true |     23922 |      67630 |
+---------------------------+------------------+---------+---------+-----------+-----------+------------+

If any nodes instead show context deadline exceeded, that likely indicates that the node is down or unreachable.

Note

Only hosts which run the configuration_server will be able to access the etcd cluster with the /usr/illumon/latest/bin/etcdctl.sh script. Nodes that do not have a configuration_server process do not have the required keyfiles to access the etcd cluster.

MariaDB / MySQL

Earlier versions of Deephaven always used a SQL database hosted in MySQL or MariaDB to store internal application users and permissions. In this version of Deephaven, the access control list (ACL) data can be stored in a SQL database or in etcd. The ACL storage type is defined at installation-time with the DH_ACLS_USE_ETCD flag, which defaults to true (use etcd ACLs).

If SQL ACL storage is used, then the MariaDB or MySQL database service is a core dependency for Deephaven processes. If the ACL database is not available, no processes will be able to start. The SQL ACL DB process is not managed by M/Monit. SQL ACL DB status can be checked with the following:

# MariaDB's status can be checked with:
sudo systemctl status mariadb

# Or, for MySQL:
sudo systemctl status mysql

Envoy

Envoy is an optionally installed reverse proxy used to allow access to Deephaven services over a single port. It is often run in a Docker container, but can be run as a regular service on a Deephaven server. If Envoy is configured, but not working correctly, Deephaven server processes will generally appear to be in a good state, but users will not be able to connect to these services from client applications.

The Envoy process is not normally managed by M/Monit. If it is being run as a Docker container, it should be visible running in the output from:

sudo docker container ls

Configuration

Details about Deephaven configuration are available in the configuration section, which discusses all aspects of Deephaven configuration. Server processes get their launch configuration (run as, heap allocation, etc.) from files in /etc/sysconfig/illumon.confs, and then read the rest of their configuration settings from properties served by the configuration server, or, for the configuration server itself, from etcd.

Server process startup configuration

If a process configuration problem is related to the process's Java JVM arguments, or the account under which the process runs, those settings can be viewed in /etc/sysconfig/illumon.confs/hostconfig.system and can be modified/overridden by editing /etc/sysconfig/illumon.confs/illumon.iris.hostconfig.

Warning

Never edit the /etc/sysconfig/illumon.confs/hostconfig.system script because the next Deephaven install/upgrade process will overwrite the changes. All changes should be made to /etc/sysconfig/illumon.confs/illumon.iris.hostconfig.

Property file configuration

Property file settings control a significant portion of Deephaven behavior. A large number of the possible settings are controlled by defaults coded in the property file iris-defaults.prop, which is stored in etcd during install/upgrade, but is also accessible from /usr/illumon/latest/etc/iris-defaults.prop. All custom properties should be overridden in install-specific property files, such as iris-environment.prop.

For example, the total heap that a single dispatcher allocates for workers is in the property RemoteQueryDispatcher.maxTotalQueryProcessorHeapMB. To see the default, run:

/usr/illumon/latest/bin/dhconfig props export iris-defaults.prop | grep "RemoteQueryDispatcher.maxTotalQueryProcessorHeapMB"

which prints the following to stdout:

RemoteQueryDispatcher.maxTotalQueryProcessorHeapMB=354304

If a configuration change is needed to a setting that a process reads during its initialization or while it is running, there are a few steps that must be followed:

Export the configuration property file from etcd, using:
```
/usr/illumon/latest/bin/dhconfig properties export -f <filename> -d <directory to export the file to>
```
Make installation-specific property changes in the iris-environment.prop property file. Never change the iris-defaults.prop or iris-endpoints.prop configuration files because the next Deephaven install/upgrade process will overwrite the changes.
Edit the exported file and make the needed changes. Potentially, make a backup of the original version that was exported, in case settings changes need to be rolled back.
Import the updated file into etcd, using:
```
/usr/illumon/latest/bin/dhconfig properties import -f <filename> -d <directory to import the file from>
```
You must provide authentication to import properties, either by specifying a privileged key or by using sudo -u irisadmin (or equivalent custom user, based on the DH_MONIT_USER property during initial system setup) to run this command as the Deephaven admin service account.

In most cases it is necessary to restart or reload processes that use the new/modified property. The authentication server (sudo -u irisadmin /usr/illumon/latest/bin/iris auth_server_reload_tool) and the iris controller can have their configuration reloaded without having to restart the process itself, except when they are running in Kubernetes.

Warning

It is critical that no changes are made in /etc/sysconfig/illumon.confs/hostconfig.system or the iris-defaults.prop and iris-endpoints.prop files. Changes made to any of these will be overwritten during the next install/upgrade process.

Data routing configuration

The other main configuration data set that might need to be modified to address a problem is the table data services routing YAML. Editing this file is similar to editing properties files:

Export the configuration property file from etcd, using:

/usr/illumon/latest/bin/dhconfig routing export -f <file name and path to export the file to>

Edit the exported file and make the needed changes. Potentially, make a backup of the original version that was exported, in case settings changes need to be rolled back.
Import the updated file into etcd, using:
```
/usr/illumon/latest/bin/dhconfig routing import -f <file name and path to import the file from>
```
You must provide authentication to import the routing file, either by specifying a key or user and password, or by using sudo to run this command as irisadmin or another Deephaven admin service account.

Warning

This file is in YAML format and has specific white space and delimiter formatting requirements. General YAML can be validated in online YAML validation tools and in some editor utilities. Deephaven-specific validation can be accomplished by adding --validate to the dh_config arguments, or by using the /usr/illumon/latest/bin/dhconfig routing validate command.

Process dependencies

etcd is a required dependency for all Deephaven server processes. Once etcd is running, the configuration server is the next process to start, and is a dependency for all other services. The authentication server is also a common dependency for most other services. There is a co-dependency between the configuration server and the authentication server; when the configuration server starts, it will attempt repeatedly to contact the authentication server, while the authentication server will fail to start if the configuration server is not running or is not yet accepting connections. The /usr/illumon/latest/bin/dh_monit up command ensures that services are started in the proper order to ensure a fast and reliable startup.

Workers are processes that are launched by RemoteQueryDispatcher services (either a db_query_service or db_merge_service process) on a node. Workers require a local log_aggregation_service in order to start.

File system

Within the cluster, there are several types of data and product files, which may be on separate volumes.

Type	Purpose	Typical Path
Logs	Text and binary logs from Deephaven processes.	`/var/log/deephaven`
Intraday	Appended data from Deephaven processes and external streams.	`/db/Intraday`
User	Tables created and managed by users. Often shared across servers with NFS.	`/db/Users`
IntradayUser	Ticking/appending tables managed by users.	`/db/IntradayUser`
Ingester	Appended data from in-worker data import processes such as Kafka (optional).	`/db/dataImportServers`
Historical	Organized data in Deephaven or Parquet format - usually shared to query servers with NFS.	`/db/Systems`
Product Files	Binaries and default configuration files.	`/usr/illumon/${version}` with a link from `/usr/illumon/latest`
Core+ Files	Binaries and Python virtual environments for Core+ workers.	`/usr/illumon/coreplus/${version}` with a link from `/usr/illumon/coreplus/latest`
Core+ VEnvs	Python Virtual environment directories for Core+ workers.	`/usr/illumon/coreplus/venv/${version}` with a link from `/usr/illumon/coreplus/venv/latest`
Configuration Files	Installation-specific files and binaries.	`/etc/sysconfig/deephaven` with links from `/etc/sysconfig/illumon.d` and `/etc/sysconfig/illumon.confs`
TempFiles	Service account home directories and storage of cached per-worker classes.	`/db/TempFiles`
VEnvs	Python Virtual environment directories for Legacy workers.	`/db/VEnvs`

Free and used disk space by volume can be seen using the df Linux command:

df
Filesystem     1K-blocks     Used Available Use% Mounted on
devtmpfs        16379736        0  16379736   0% /dev
tmpfs           16388492        0  16388492   0% /dev/shm
tmpfs           16388492     9108  16379384   1% /run
tmpfs           16388492        0  16388492   0% /sys/fs/cgroup
/dev/sda2      104640560 15577992  89062568  15% /
/dev/sda1         204580    11464    193116   6% /boot/efi
tmpfs            3277700        0   3277700   0% /run/user/996
tmpfs            3277700        0   3277700   0% /run/user/1006
tmpfs            3277700        0   3277700   0% /run/user/9001
tmpfs            3277700        0   3277700   0% /run/user/9000
tmpfs            3277700        0   3277700   0% /run/user/0

This is a fairly healthy system - even on the / mount point, which is hosting all Deephaven paths, only 15% is used.

To view disk space used by directory, the du utility provides many options.

du -h -d1 /var/log/deephaven/
0	/var/log/deephaven/deploy_schema
13M	/var/log/deephaven/dis
2.6M	/var/log/deephaven/merge_server
4.7M	/var/log/deephaven/ltds
2.8M	/var/log/deephaven/query_server
223M	/var/log/deephaven/tdcp
120K	/var/log/deephaven/monit
808K	/var/log/deephaven/install_configuration
0	/var/log/deephaven/previous_install
2.5M	/var/log/deephaven/acl_write_server
14M	/var/log/deephaven/authentication_server
5.2G	/var/log/deephaven/binlogs
2.3M	/var/log/deephaven/configuration_server
3.2M	/var/log/deephaven/iris_controller
3.4M	/var/log/deephaven/las
316K	/var/log/deephaven/misc
0	/var/log/deephaven/plugins
3.1M	/var/log/deephaven/web_api_service
224M	/var/log/deephaven/tailer
5.7G	/var/log/deephaven/

In this example, -h is for "human-readable" output, and -d1 is to limit depth under /var/log/deephaven/ to one level. The result is space used by subdirectory under /var/log/deephaven/, and a summary of total space used by /var/log/deephaven/.

The du command also accepts the -s flag to print only the summary of the path you pass to it.

Note

Both the du and df commands accept a filesystem path as an argument, to specify disk usage and disk free space for a particular location. Both also accept the -h argument for "human-readable" sizes, as does the sort command. Thus, you can obtain a directory list sorted by disk usage by running du -sh /some/directory/* | sort -h.

File system cleanup

When working with large amounts of data, large amounts of disk space are needed. You should delete old logs and unused binary files regularly to avoid running out of disk space.

Cleaning `/var/log/deephaven`

The /var/log/deephaven/ path is one where it is generally safe to delete old data.

Warning

Because the files that exist under /var/log/deephaven are being actively written by live processes, rm should not be used to batch delete files. Instead, find must be used to find files to delete based on modification time, with the -delete option to remove them.

For example, find /var/log/deephaven -type f -mtime +7 -delete will find all files older than 7 days and remove them. This will include files under /var/log/deephaven/binlogs, which may warrant their own retention period. If you need to delete write protected files, you may need to use -exec rm -f {} + instead of -delete.

Cleaning old versions of Deephaven

Deephaven uses a "versioned" installation process, where each new version is unpacked into a set of directories, and then soft links point a latest link to the currently installed version.

Always keep at least two previous versions on disk in case you need to roll back to a previous version. However, keeping any more than five is unlikely to ever provide value.

The locations with latest-linked versioned product files are:

/usr/illumon/latest
- Enterprise binary files, script and default configuration files
/usr/illumon/coreplus/latest
- Core+ binary files
/usr/illumon/coreplus/venv/latest
- Core+ Python virtual environments
/etc/sysconfig/deephaven/illumon.d.latest
- Copies of user-owned system configuration files (everything in /etc/sysconfig/illumon.d)
/etc/sysconfig/deephaven/illumon.confs.latest
- "hostconfig" environment files; uses very little disk space

For all of the above directories containing a latest link, you can find a list of old versions using ls -t. Each location has one or more directories which follow a naming convention that contain the Deephaven version, as well as other directories which you do not want to delete. Thus, you should use grep to filter your results before choosing what to delete.

For example:

ls -t /usr/illumon
grep 20240517
tail -n +4

This command list all but the four newest versioned directories, which you can delete with:

rm -rf

Cleaning old installation files

The installer automatically deletes logs and uploaded configuration files after an installation if you set DH_CLEANUP=true in your cluster.cnf. There are also large product .tar files left in /var/lib/deephaven which may grow over time. Having these files present speeds up reinstallation of a particular version, but are no longer necessary once the installation process is completed.

All files in /var/lib/deephaven can be safely deleted at any time. However you may need to provide these files again if you wish to reinstall a particular version of Deephaven. You can provide them by uploading them directly to /var/lib/deephaven, owned by your DH_ADMIN_USER (irisadmin), or placed in your installer DH_LOCAL_DIR directory before performing a reinstallation.

Cleaning Deephaven data files

Warning

Data under /db/Intraday may include data older than one day, if the data has not been through a merge and purge process. /db/Intraday/DbInternal is used for Deephaven internal tables, most of which are logs such as the ProcessEventLog and AuditEventLog whose retention is more a concern of policy than of system functionality.

The critical exception to this is the WorkspaceData table, which is stored under /db/Intraday/DbInternal/WorkspaceData. This table maintains all Web user dashboards and notebooks. Unless a merge job has been set up for this table, all of its history should be maintained under /db/Intraday to preserve Web user content which may have been created or last updated some time ago. On default installations, the WorkspaceDataSnapshot is updated with a snapshot of the WorkspaceData table, but without tracking the history of individual items.

See table storage and merging data for more information about setting up historical data storage and merge processes to manage data stored in /db/Intraday.

See the workspaceData tool for details of how to export and import the contents of the WorkspaceData table.

Tools

vi, vim, etc - used to view and edit text files.
less - also can be used to view text files.
tail -f - to see new rows as they are appended to a watched file.
tail -F - similar to tail -f, but will follow logfiles which are periodically rolled in referenced via soft-link.
telnet - can be used to verify connectivity to a port endpoint; e.g., to verify that a service is listening and the port is not blocked by a firewall.
iriscat - Deephaven tool to dump a binary log file as text.
dhconfig checkpoint - Deephaven tool to view the details of a table partition on disk.
kill -3 and jstack - Tools to generate a Java process thread dump.
ps - Shows running processes on a Linux system.
top - Shows resource utilization information for a Linux system.
openssl - Can be used to initiate SSL connections and examine certificates.
curl - Can retrieve data from web servers.

Status Dashboard

Deephaven includes a status dashboard process which provides data that can be integrated with Prometheus and Grafana. See the status dashboard page for further details.

System status overview

Quick check of system operation

Common causes of outages

Disk space

Expired certificate

Connectivity disruption

System (re)configuration

System services, components, and configuration

Processes

Deephaven application processes

etcd

MariaDB / MySQL

Envoy

Configuration

Server process startup configuration

Property file configuration

Data routing configuration

Process dependencies

File system

File system cleanup

Cleaning /var/log/deephaven

Cleaning old versions of Deephaven

Cleaning old installation files

Cleaning Deephaven data files

Tools

Status Dashboard

Related documentation

Cleaning `/var/log/deephaven`