System status overview
This guide outlines system architecture and resources, along with management tools, logs, troubleshooting, and common failure modes. It is intended as a central point for topics relevant to diagnosing unexpected behaviors, failures, and system availability issues.
Most of the topics in this section are covered in more detail in other sections of the Deephaven Enterprise documentation.
Quick check of system operation
Details of these commands and underlying processes are in the sections below.
-
Check for running processes (on each server):
/usr/illumon/latest/bin/dh_monit summary
This should report all processes with a green OK status. The number of processes listed will vary depending on server configuration.
-
Attempt to start a Web console session: from a Chrome, Firefox, or Edge browser, go to
https://<your_deephaven_fqdn>:8123/iriside
(default port without Envoy) orhttps://<your_deephaven_fqdn>:8000/iriside
(default port with Envoy). This should bring up a Web client login dialog. -
Attempt to log in (credentials will vary based on your environment's setup). This should show
Loading Workspace
and eventually bring up a Web Console. If login fails with a "Cannot connect, query config has status 'Stopped'" error, try starting theWebClientData
query using safe mode. -
As a member of the iris-superusers group, click on Query Monitor. Clear any filters on this page. This should show the status of persistent queries running in the cluster. At the least the
ImportHelperQuery
,RevertHelperQuery
, andWebClientData
queries should be in Running state. -
Click on an existing Code Studio, or click on (+) New and New Code Studio. This should bring up a New Console Connection dialog.
-
Attempt, one at a time, to start Core+ Consoles using each available Query Server, a reasonable amount of heap (e.g., 4GB), and with Groovy and/or Python depending on what is used in your environment.
-
For each Console connection, attempt simple queries to validate that data can be retrieved and displayed:
p = db.liveTable("DbInternal","ProcessEventLog").where("Date=today()") d = db.historicalTable("LearnDeephaven","StockTrades").where("Date=`2017-08-25`")
p = db.live_table("DbInternal","ProcessEventLog").where("Date=today()") d = db.historical_table("LearnDeephaven","StockTrades").where("Date=`2017-08-25`")
-
Note
The second query in the above example requires that the LearnDeephaven
namespace and data set has been installed. See Install LearnDeephaven
for details.
Common causes of outages
Disk space
One of most common causes of outages is an "out of disk space" condition on one of Deephaven's volumes. See the file system section below for details of different storage areas and their uses, as well as tips for checking usage.
Expired certificate
An expired certificate will block connecting to the Web UI and updating from the Deephaven Launcher tool. It will also cause workers to fail after launch since they will not be able to authenticate with their dispatchers. See checking certificates and replacing certificates for more.
Connectivity disruption
Deephaven processes communicate over the network. When there is a network disruption or loss of connectivity, processes may fail. Even when there is no application-level network traffic between the processes, regular heartbeats are sent. If heartbeat messages are not acknowledged, processes may consider their upstream dependencies to be unreachable, and exit.
System (re)configuration
This can happen in particular when configuration management tools are used, which may reset settings, such as sudoers permissions.
- OS settings such as resource limits.
- Required sudoers permissions
System services, components, and configuration
A Deephaven installation consists of a set of Java processes running on one or more Linux servers. Production installations typically have three or more nodes. Most of the Deephaven configuration is stored in etcd. Having three etcd nodes (or more, but always an odd number) provides high availability access to configuration data and protects the configuration data in the case of loss of a node. For example, here is an overview of a three-node cluster:
Processes
Deephaven servers will generally have three or more Deephaven processes running as system services. These processes are all interdependent, often relying on the configuration_server
and authentication_server
processes (which may be running on other nodes within the cluster). See process dependencies below.
Note
See also Details about Deephaven processes, including the ports on which they communicate, can be found in our Deephaven process ports, Deephaven services, and runbooks sections of the documentation.
Deephaven application processes
Deephaven uses M/Monit to manage system services. There is a wrapper script for its use (to allow non-root users to use it with sudo) at /usr/illumon/latest/bin/dh_monit
. Only users with sudo -u irisadmin
access (or equivalent custom user, based on the DH_MONIT_USER
property during initial system setup) can use this tool.
Monit uses process PID files stored in /etc/deephaven/run
to track the running status of the processes it manages. It will try periodically to start a process that failed to start. The update process for status is also rather slow, so it may be a minute or two before services being started show an OK status.
Common commands with /usr/illumon/latest/bin/dh_monit
are:
-
/usr/illumon/latest/bin/dh_monit summary
to show a list of processes and their statuses. -
/usr/illumon/latest/bin/dh_monit start
followed byall
or a process name, to start a process or all processes. -
/usr/illumon/latest/bin/dh_monit restart
followed byall
or a process name, to restart a process or all processes. -
/usr/illumon/latest/bin/dh_monit stop
followed byall
or a process name, to stop a process or all processes. -
/usr/illumon/latest/bin/dh_monit status
to see detailed status information, such as uptime, for all processes, or, if specified, for a particular process. -
/usr/illumon/latest/bin/dh_monit up
to start all processes. This does not affect already running processes. -
/usr/illumon/latest/bin/dh_monit down
to stop all processes. -
/usr/illumon/latest/bin/dh_monit reload
to reload the configuration of M/Monit. This is normally needed only after adding or removing a process (conf files in/etc/sysconfig/deephaven/monit
). It does not affect processes managed by M/Monit.
An optional flag, -b
or --block
, is available for the start
, restart
, stop
, up
, and down
actions. This causes the script to wait for the actions to fully complete before returning. If either flag is used with another action, it will generate an error.
# To start all processes:
/usr/illumon/latest/bin/dh_monit start all
# or:
/usr/illumon/latest/bin/dh_monit up
# To stop all processes and wait for all processes to have stopped:
/usr/illumon/latest/bin/dh_monit stop all --block
# or:
/usr/illumon/latest/bin/dh_monit down --block
# To see a list of all processes and their uptime:
/usr/illumon/latest/bin/dh_monit status | grep "uptime\|Process"
etcd
etcd is a dependency for Deephaven, and may be installed on Deephaven server nodes, or on servers that are not part of the Deephaven cluster. If etcd is inaccessible, no Deephaven services will be able to start.
etcd is not managed by M/Monit. etcd's status can be checked with:
sudo systemctl status dh-etcd
The health of the etcd cluster can be checked by logging in (via ssh
) to a Deephaven server that is running the configuration_server
process, and executing:
# `irisadmin` or equivalent custom user based on the `DH_MONIT_USER` property during initial system setup
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status -w table
This should return a table showing all the nodes in the etcd cluster and some details about their roles and DB size. Something like:
+---------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.10.10.53:2379 | 20f1fe672cdca01d | 3.5.5 | 91 MB | false | 23922 | 67628 |
| https://10.10.10.54:2379 | 8cdf5ce8a296848f | 3.5.5 | 91 MB | false | 23922 | 67629 |
| https://10.10.10.55:2379 | 4ea3e72f6e028887 | 3.5.5 | 91 MB | true | 23922 | 67630 |
+---------------------------+------------------+---------+---------+-----------+-----------+------------+
If any nodes instead show context deadline exceeded
, that likely indicates that the node is down or unreachable.
Note
Only hosts which run the configuration_server
will be able to access the etcd cluster with the /usr/illumon/latest/bin/etcdctl.sh
script. Nodes that do not have a configuration_server
process do not have the required keyfiles to access the etcd cluster.
MariaDB / MySQL
Earlier versions of Deephaven always used a SQL database hosted in MySQL or MariaDB to store internal application users and permissions. In this version of Deephaven, the access control list (ACL) data can be stored in a SQL database or in etcd. The ACL storage type is defined at installation-time with the DH_ACLS_USE_ETCD
flag, which defaults to true
(use etcd ACLs).
If SQL ACL storage is used, then the MariaDB or MySQL database service is a core dependency for Deephaven processes. If the ACL database is not available, no processes will be able to start. The SQL ACL DB process is not managed by M/Monit. SQL ACL DB status can be checked with the following:
# MariaDB's status can be checked with:
sudo systemctl status mariadb
# Or, for MySQL:
sudo systemctl status mysql
Envoy
Envoy is an optionally installed reverse proxy used to allow access to Deephaven services over a single port. It is often run in a Docker container, but can be run as a regular service on a Deephaven server. If Envoy is configured, but not working correctly, Deephaven server processes will generally appear to be in a good state, but users will not be able to connect to these services from client applications.
The Envoy process is not normally managed by M/Monit. If it is being run as a Docker container, it should be visible running in the output from:
sudo docker container ls
Configuration
Details about Deephaven configuration are available in the configuration section, which discusses all aspects of Deephaven configuration. Server processes get their launch configuration (run as, heap allocation, etc.) from files in /etc/sysconfig/illumon.confs
, and then read the rest of their configuration settings from properties served by the configuration server, or, for the configuration server itself, from etcd.
Server process startup configuration
If a process configuration problem is related to the process's Java JVM arguments, or the account under which the process runs, those settings can be viewed in /etc/sysconfig/illumon.confs/hostconfig.system
and can be modified/overridden by editing /etc/sysconfig/illumon.confs/illumon.iris.hostconfig
.
Warning
Never edit the /etc/sysconfig/illumon.confs/hostconfig.system
script because the next Deephaven install/upgrade process will overwrite the changes. All changes should be made to /etc/sysconfig/illumon.confs/illumon.iris.hostconfig
.
Property file configuration
Property file settings control a significant portion of Deephaven behavior. A large number of the possible settings are controlled by defaults coded in the property file iris-defaults.prop
, which is stored in etcd during install/upgrade, but is also accessible from /usr/illumon/latest/etc/iris-defaults.prop
. All custom properties should be overridden in install-specific property files, such as iris-environment.prop
.
For example, the total heap that a single dispatcher allocates for workers is in the property RemoteQueryDispatcher.maxTotalQueryProcessorHeapMB
. To see the default, run:
/usr/illumon/latest/bin/dhconfig props export iris-defaults.prop | grep "RemoteQueryDispatcher.maxTotalQueryProcessorHeapMB"
which prints the following to stdout
:
RemoteQueryDispatcher.maxTotalQueryProcessorHeapMB=354304
If a configuration change is needed to a setting that a process reads during its initialization or while it is running, there are a few steps that must be followed:
-
Export the configuration property file from etcd, using:
/usr/illumon/latest/bin/dhconfig properties export -f <filename> -d <directory to export the file to>
Make installation-specific property changes in the
iris-environment.prop
property file. Never change theiris-defaults.prop
oriris-endpoints.prop
configuration files because the next Deephaven install/upgrade process will overwrite the changes. -
Edit the exported file and make the needed changes. Potentially, make a backup of the original version that was exported, in case settings changes need to be rolled back.
-
Import the updated file into etcd, using:
/usr/illumon/latest/bin/dhconfig properties import -f <filename> -d <directory to import the file from>
You must provide authentication to import properties, either by specifying a privileged key or by using
sudo -u irisadmin
(or equivalent custom user, based on theDH_MONIT_USER
property during initial system setup) to run this command as the Deephaven admin service account.
In most cases it is necessary to restart or reload processes that use the new/modified property. The authentication server (sudo -u irisadmin /usr/illumon/latest/bin/iris auth_server_reload_tool
) and the iris controller can have their configuration reloaded without having to restart the process itself, except when they are running in Kubernetes.
Warning
It is critical that no changes are made in /etc/sysconfig/illumon.confs/hostconfig.system
or the iris-defaults.prop
and iris-endpoints.prop
files. Changes made to any of these will be overwritten during the next install/upgrade process.
Data routing configuration
The other main configuration data set that might need to be modified to address a problem is the table data services routing YAML. Editing this file is similar to editing properties files:
-
Export the configuration property file from etcd, using:
/usr/illumon/latest/bin/dhconfig routing export -f <file name and path to export the file to>
-
Edit the exported file and make the needed changes. Potentially, make a backup of the original version that was exported, in case settings changes need to be rolled back.
-
Import the updated file into etcd, using:
/usr/illumon/latest/bin/dhconfig routing import -f <file name and path to import the file from>
You must provide authentication to import the routing file, either by specifying a key or user and password, or by using
sudo
to run this command asirisadmin
or another Deephaven admin service account.
Warning
This file is in YAML format and has specific white space and delimiter formatting requirements. General YAML can be validated in online YAML validation tools and in some editor utilities. Deephaven-specific validation can be accomplished by adding --validate
to the dh_config
arguments, or by using the /usr/illumon/latest/bin/dhconfig routing validate
command.
Process dependencies
From a dependency perspective, etcd is a required dependency for all Deephaven server processes. Once etcd is running, the configuration server is the next process to start, and is a dependency for all other services. The authentication server is also a common dependency for most other services. There is a co-dependency between the configuration server and the authentication server; when the configuration server starts, it will attempt repeatedly to contact the authentication server, while the authentication server will fail to start if the configuration server is not running or is not yet accepting connections. The /usr/illumon/latest/bin/dh_monit up
command ensures that services are started in the proper order to ensure a fast and reliable startup.
Workers are processes that are launched by RemoteQueryDispatcher
services (either a db_query_service
or db_merge_service
process) on a node. Workers require a local log_aggregation_service
in order to start.
File system
Within the cluster, there are several types of data and product files, which may be on separate volumes.
Type | Purpose | Typical Path |
---|---|---|
Logs | Text and binary logs from Deephaven processes. | /var/log/deephaven |
Intraday | Appended data from Deephaven processes and external streams. | /db/Intraday |
User | Tables created and managed by users. Often shared across servers with NFS. | /db/Users |
IntradayUser | Ticking/appending tables managed by users. | /db/IntradayUser |
Ingester | Appended data from in-worker data import processes such as Kafka (optional). | /db/dataImportServers |
Historical | Organized data in Deephaven or Parquet format - usually shared to query servers with NFS. | /db/Systems |
Product Files | Binaries and default configuration files. | /usr/illumon/${version} with a link from /usr/illumon/latest |
Core+ Files | Binaries and Python virtual environments for Core+ workers. | /usr/illumon/coreplus/${version} with a link from /usr/illumon/coreplus/latest |
Core+ VEnvs | Python Virtual environment directories for Core+ workers. | /usr/illumon/coreplus/venv/${version} with a link from /usr/illumon/coreplus/venv/latest |
Configuration Files | Installation-specific files and binaries. | /etc/sysconfig/deephaven with links from /etc/sysconfig/illumon.d and /etc/sysconfig/illumon.confs |
TempFiles | Service account home directories and storage of cached per-worker classes. | /db/TempFiles |
VEnvs | Python Virtual environment directories for Legacy workers. | /db/VEnvs |
Free and used disk space by volume can be seen using the df
Linux command:
df
Filesystem 1K-blocks Used Available Use% Mounted on
devtmpfs 16379736 0 16379736 0% /dev
tmpfs 16388492 0 16388492 0% /dev/shm
tmpfs 16388492 9108 16379384 1% /run
tmpfs 16388492 0 16388492 0% /sys/fs/cgroup
/dev/sda2 104640560 15577992 89062568 15% /
/dev/sda1 204580 11464 193116 6% /boot/efi
tmpfs 3277700 0 3277700 0% /run/user/996
tmpfs 3277700 0 3277700 0% /run/user/1006
tmpfs 3277700 0 3277700 0% /run/user/9001
tmpfs 3277700 0 3277700 0% /run/user/9000
tmpfs 3277700 0 3277700 0% /run/user/0
This is a fairly healthy system - even on the /
mount point, which is hosting all Deephaven paths, only 15% is used.
To view disk space used by directory, the du
utility provides many options.
du -h -d1 /var/log/deephaven/
0 /var/log/deephaven/deploy_schema
13M /var/log/deephaven/dis
2.6M /var/log/deephaven/merge_server
4.7M /var/log/deephaven/ltds
2.8M /var/log/deephaven/query_server
223M /var/log/deephaven/tdcp
120K /var/log/deephaven/monit
808K /var/log/deephaven/install_configuration
0 /var/log/deephaven/previous_install
2.5M /var/log/deephaven/acl_write_server
14M /var/log/deephaven/authentication_server
5.2G /var/log/deephaven/binlogs
2.3M /var/log/deephaven/configuration_server
3.2M /var/log/deephaven/iris_controller
3.4M /var/log/deephaven/las
316K /var/log/deephaven/misc
0 /var/log/deephaven/plugins
3.1M /var/log/deephaven/web_api_service
224M /var/log/deephaven/tailer
5.7G /var/log/deephaven/
In this example, -h
is for "human-readable" output, and -d1
is to limit depth under /var/log/deephaven/
to one level. The result is space used by subdirectory under /var/log/deephaven/
, and a summary of total space used by /var/log/deephaven/
.
The du
command also accepts the -s
flag to print only the summary of the path you pass to it.
Note
Both the du
and df
commands accept a filesystem path as an argument, to specify disk usage and disk free space for a particular location. Both also accept the -h
argument for "human-readable" sizes, as does the sort
command. Thus, you can obtain a directory list sorted by disk usage by running du -sh /some/directory/* | sort -h
.
File system cleanup
When working with large amounts of data, large amounts of disk space are needed. You should delete old logs and unused binary files regularly to avoid running out of disk space.
Cleaning /var/log/deephaven
The /var/log/deephaven/
path is one where it is generally safe to delete old data.
Warning
Because the files that exist under /var/log/deephaven
are being actively written by live processes, rm
should not be used to batch delete files. Instead, find
must be used to find files to delete based on modification time, with the -delete
option to remove them.
For example, find /var/log/deephaven -type f -mtime +7 -delete
will find all files older than 7 days and remove them. This will include files under /var/log/deephaven/binlogs
, which may warrant their own retention period. If you need to delete write protected files, you may need to use -exec rm -f {} +
instead of -delete
.
Cleaning old versions of Deephaven
Deephaven uses a "versioned" installation process, where each new version is unpacked into a set of directories, and then soft links point a latest
link to the currently installed version.
Always keep at least two previous versions on disk in case you need to roll back to a previous version. However, keeping any more than five is unlikely to ever provide value.
The locations with latest
-linked versioned product files are:
/usr/illumon/latest
- Enterprise binary files, script and default configuration files
/usr/illumon/coreplus/latest
- Core+ binary files
/usr/illumon/coreplus/venv/latest
- Core+ Python virtual environments
/etc/sysconfig/deephaven/illumon.d.latest
- Copies of user-owned system configuration files (everything in
/etc/sysconfig/illumon.d
)
- Copies of user-owned system configuration files (everything in
/etc/sysconfig/deephaven/illumon.confs.latest
- "hostconfig" environment files; uses very little disk space
For all of the above directories containing a latest
link, you can find a list of old versions using ls -t
. Each location has one or more directories which follow a naming convention that contain the Deephaven version, as well as other directories which you do not want to delete. Thus, you should use grep
to filter your results before choosing what to delete.
Example: ls -t /usr/illumon | grep 20240517 | tail -n +4
will list "all but the four newest versioned directories", which you can | rm -rf
to delete.
Cleaning old installation files
The installer automatically deletes logs and uploaded configuration files after an installation if you set DH_CLEANUP=true
in your cluster.cnf
. There are also large product .tar
files left in /var/lib/deephaven
which may grow over time. Having these files present speeds up reinstallation of a particular version, but are no longer necessary once the installation process is completed.
All files in /var/lib/deephaven
can be safely deleted at any time. However you may need to provide these files again if you wish to reinstall a particular version of Deephaven. You can provide them by uploading them directly to /var/lib/deephaven
, owned by your DH_ADMIN_USER
(irisadmin
), or placed in your installer DH_LOCAL_DIR
directory before performing a reinstallation.
Cleaning Deephaven data files
Warning
Data under /db/Intraday
may include data older than one day, if the data has not been through a merge and purge process. /db/Intraday/DbInternal
is used for Deephaven internal tables, most of which are logs such as the ProcessEventLog
and AuditEventLog
whose retention is more a concern of policy than of system functionality.
The critical exception to this is the WorkspaceData
table, which is stored under /db/Intraday/DbInternal/WorkspaceData
. This table maintains all Web user dashboards and notebooks. Unless a merge job has been set up for this table, all of its history should be maintained under /db/Intraday
to preserve Web user content which may have been created or last updated some time ago. On default installations, the WorkspaceDataSnapshot is updated with a snapshot of the WorkspaceData
table, but without tracking the history of individual items.
See table storage and merging data for more information about setting up historical data storage and merge processes to manage data stored in /db/Intraday
.
See the workspaceData tool for details of how to export and import the contents of the WorkspaceData
table.
Tools
vi
,vim
, etc - used to view and edit text files.less
- also can be used to view text files.tail -f
- to see new rows as they are appended to a watched file.tail -F
- similar totail -f
, but will follow logfiles which are periodically rolled in referenced via soft-link.telnet
- can be used to verify connectivity to a port endpoint; e.g., to verify that a service is listening and the port is not blocked by a firewall.iriscat
- Deephaven tool to dump a binary log file as text.dhconfig checkpoint
- Deephaven tool to view the details of a table partition on disk.kill -3
andjstack
- Tools to generate a Java process thread dump.ps
- Shows running processes on a Linux system.top
- Shows resource utilization information for a Linux system.openssl
- Can be used to initiate SSL connections and examine certificates.curl
- Can retrieve data from web servers.
Status Dashboard
Deephaven includes a status dashboard process which provides data that can be integrated with Prometheus and Grafana. See the status dashboard page for further details.
Related documentation
- Checking certificates
- Configuration overview
- Deephaven process ports
- Deephaven process runbooks
- Deephaven services
- Replacing certificates
- Install
- Install
LearnDeephaven
- Introduction to etcd
- iris controller
- iris-superusers
- Merging data
- Merge and purge
- Property file
- Safe mode
- Status dashboard
- Sudoers permissions
- Table storage
- Table data services routing YAML
- Upgrade
- workspaceData tool
- WorkspaceDataSnapshot