Deephaven Process runbooks

This section outlines the procedures for each Deephaven process.

Incident classification key

SeverityDescription
0 - NoneProcess is running (or down as scheduled).
1 - CriticalProcess is down when it should be up.
2 - ModerateProcess is up when it should be down; or process is up but configuration is missing.
3 - LowProcess is running but producing errors or performing poorly.

Authentication Server Process

LevelImpact
Sev 1 - CriticalNew users will be unable to login or create new queries

Procedures

Check Process is running with Monit:

sudo -u irisadmin monit status authentication_server

View Application Log Files:

cat /var/log/deephaven/authentication_server/AuthenticationServer.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/authentication_server/authentication_server.log.????-??-??

Check status of MariaDB/MySQL dependency (if MySQL is used to store ACLs):

sudo systemctl status mariadb

Check etcd endpoint status (if etcd is used to store ACLs):

sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out table

Restart Procedure:

sudo -u irisadmin monit restart authentication_server

ACL Write Server Process

LevelImpact
Sev 2 - ModerateAdministrators will not be able to update user permissions and groups

Procedures

Check process is running with Monit:

sudo -u irisadmin monit status db_acl_write_server

View Application Log Files:

cat /var/log/deephaven/acl_write_server/DbAclWriteServer.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/acl_write_server/db_acl_write_server.log.????-??-??

Check status of MariaDB/MySQL dependency (if MySQL is used to store ACLs):

sudo systemctl status mariadb

Check etcd endpoint status (if etcd is used to store ACLs):

sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out table

Restart Procedure:

sudo -u irisadmin monit restart db_acl_write_server

Configuration Server Process

LevelImpact
Sev 1 - CriticalNone of the system processes will be able to start.

Procedures

Check Process is running with Monit:

sudo -u irisadmin monit status configuration_server

View Application Log Files:

cat /var/log/deephaven/configuration_server/ConfigurationServer.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/configuration_server/configuration_server.log.????-??-??

Restart Procedure:

sudo -u irisadmin monit restart configuration_server

Persistent Query Controller Process

| Level | Impact | | :--------------- | :------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Sev 1 - Critical | When configured to run with multiple controllers, running Core+ queries are migrated to a running controller. | Core+ queries that are not yet running are terminated. All Legacy queries, including WebClientData, are terminated. Until the WebClientData reinitializes, users are not able to load the Deephaven console. |

Procedures

Check Process is running with Monit:

sudo -u irisadmin monit status iris_controller

View Application Log Files:

cat /var/log/deephaven/iris_controller/PersistentQueryController.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/iris_controller/iris_controller.log.????-??-??

Restart Procedure:

sudo -u irisadmin monit restart iris_controller

Persistent Query Backup and Restore Process

LevelImpact
Sev 1 - CriticalThe controller stores persistent queries in etcd, so it is strongly recommended that periodic backups be taken of this data. The ability to restore persistent queries is critical.

Procedures

To export all Deephaven queries, use the following command:

sudo /usr/illumon/latest/bin/dhconfig pq export --file /tmp/PersistentQueryBackup.xml

To import your queries to any controller running the same Deephaven version, use the following command:

sudo /usr/illumon/latest/bin/dhconfig pq import --file /tmp/PersistentQueryBackup.xml

It may be useful to keep each query's serial ID so that user workspaces will continue to work. In this case, you can add the following parameter, which will keep each query's original serial, but not import any query if a query already exists with the same serial:

--retainSerial=keep

To keep the original serial IDs and also overwrite existing queries with the same IDs, instead use:

--retainSerial=replace

For full details, see the Persistent Query Controller Tool.

Log Aggregator Service (LAS) Process

LevelImpact
Sev 1 - CriticalAny process configured to use the LAS will fail to write logs to the database. This will cause failure of these processes, including the query workers.

Procedures

Check Process is running with Monit:

sudo -u irisadmin monit status log_aggregator_service

View Application Log Files:

cat /var/log/deephaven/las/LogAggregatorService.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/las/log_aggregator_service.log.????-??-??

Restart Procedure:

sudo -u irisadmin monit restart log_aggregator_service

Tailer 1 Process

LevelImpact
Sev 2 - ModerateUsers will not be directly affected, but internal Deephaven logs (including state, configuration, process and event logs) will not be written to the database.

Procedures

Check Process is running with Monit:

sudo -u irisadmin monit status tailer1

View Application Log Files:

cat /var/log/deephaven/tailer/LogtailerMain1.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/tailer/tailer1.log.????-??-??

Restart Procedure:

sudo -u irisadmin monit restart tailer1

Data Import Server Process

LevelImpact
Sev 1 - CriticalBinary log file data will not be written to the database. Binary store imports will fail.

Procedures

Check Process is running with Monit:

sudo -u irisadmin monit status db_dis

View Application Log Files:

cat /var/log/deephaven/dis/DataImportServer.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/dis/db_dis.log.????-??-??

Restart Procedure:

sudo -u irisadmin monit restart db_dis

Procedure for cleaning up corrupt intraday data

In the event that intraday ticking data becomes corrupted, you do not need to stop the DIS. Instead, simply clean up the intraday data and the DIS's state. In general, that means the following commands, run as the dbmerge user:

rm -r /db/Intraday/[namespace]/[tablename]/[intraday partition]/[date]

For your Order/Event table, you might use:

rm -r /db/Intraday/Order/Event/*/2018-02-09

Deephaven Merge Server Process

LevelImpact
Sev 2 - ModeratePersistent queries for Merges and Imports will fail.

Procedures

Check Process is running with Monit:

sudo -u irisadmin monit status db_merge_server

View Application Log Files:

cat /var/log/deephaven/merge_server/db_merge_server.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/merge_server/db_merge_server.log.????-??-??

Restart Procedure:

sudo -u irisadmin monit restart db_merge_server

Remote Query Dispatcher Process

LevelImpact
Sev 1 - CriticalAny running query workers will terminate, and new ones cannot be started. This includes all running persistent queries as well as interactive consoles.

Procedures

Check Process is running with Monit:

sudo -u irisadmin monit status db_query_server

View Application Log Files:

cat /var/log/deephaven/query_server/RemoteQueryDispatcher.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/query_server/db_query_server.log.????-??-??

Restart Procedure:

sudo -u irisadmin monit restart db_query_server

Process Shutdown

Each Deephaven process has a shutdown manager, set by the property default.processEnvironmentFactory. The shutdown manager ensure that processes terminate in an orderly and timely manner. If a process fails to terminate cleanly, the shutdown manager will stop it forcefully after a timeout set by property ShutdownManager.deephaven.shutdownTimeoutMillis. Modify the following default to change the timeout for worker and dispatcher shutdown.

# override the shutdown timeout for all workers
[service.name=dbquery|dbmerge] {
    ShutdownManager.deephaven.shutdownTimeoutMillis=60000
}

Table Data Cache Proxy Process

LevelImpact
Sev 1 - CriticalIntraday data will not be available.

Procedures

Check Process is running with Monit:

sudo -u irisadmin monit status db_tdcp

View Application Log Files:

cat /var/log/deephaven/tdcp/TableDataCacheProxy.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/tdcp/db_tdcp.log.????-??-??

Restart Procedure:

sudo -u irisadmin monit restart db_ltds

Local Table Data Server Process

LevelImpact
Sev 2 - ModerateIf the LTDS is configured in the routing, then any data it serves will not be available.

Procedures

Check Process is running with Monit:

sudo -u irisadmin monit status db_ltds

View Application Log Files:

cat /var/log/deephaven/ltds/LocalTableDataServer.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/ltds/db_ltds.log.????-??-??

Restart Procedure:

sudo -u irisadmin monit restart db_ltds

Status Dashboard

LevelImpact
Sev 2 - ModerateStatus dashboard data will not be available.

Procedures

Check Process is running with Monit:

sudo -u irisadmin monit status status_dashboard

View Application Log Files:

cat /var/log/deephaven/status_dashboard/StatusDashboard.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/status_dashboard/status_dashboard.log.????-??-??

Restart Procedure:

sudo -u irisadmin monit restart db_ltds

Web API Service Process Table

LevelImpact
Sev 1 - CriticalBoth Web API clients and Deephaven Console GUI Users will be impacted. Users will not be able to use the Launcher and Deephaven Clients will not be able to receive any updates from the server.

Procedures

Enable the Web API Service:

The Web API Service is disabled by default.

In the M/Monit config folder, remove the .disabled extension from the Web API Service config file name and run monit reload. This will instruct the M/Monit daemon to reread its configuration and re-initialize.

cd /etc/sysconfig/illumon.d/monit
mv web_api_service.disabled web_api_service.conf
sudo -u irisadmin monit reload

Check Process is running with Monit:

sudo -u irisadmin monit status web_api_service

View Application Log Files:

cat /var/log/deephaven/web_api_service/WebServer.log.current

List Log Files for Standard Out/Error:

ls -ltr /var/log/deephaven/web_api_service/web_api_service.log.????-??-??

Restart Procedure:

sudo -u irisadmin monit restart web_api_service

Web API Server TLS Keystore (.p12 keystore file)

The Web API Server's TLS keystore contains the certificate and private key of a TLS enabled service. You must keep this file private, and not distribute it to clients. The Web API Servers keystore file should be unique per node, with a certificate that is signed (issued) by a trusted CA.

The default self-signed key pair for the Web API Server is generated when installing the iris-config.rpm and saved to .p12 keystore file. This default keystore will work, but the browser will give security warnings until you use your own a CA-signed Certificate (see below).

[-r--r----- irisadmin dbquery ] webServices-keystore.p12

The Web Server keystore file is also protected by a unique randomly generated password stored in base64 encoded format in a read-only hidden file owned by user iriadmin and readable by dbquery group with permission set to 440:

[-r--r----- irisadmin dbquery] .webapi_passphrase

Important keystore properties and files

Keystore Filename: /etc/sysconfig/illumon.d/auth/keystore.webServices-keystore.p12

Passphrase File: /db/TempFiles/irisadmin/.webapi_passphrase

Keystore Property: WebServer.tls.keystore=/etc/sysconfig/illumon.d/auth/webServices-keystore.p12

Passphrase Property: WebServer.tls.passphrase.file=/db/TempFiles/irisadmin/.webapi_passphrase

[1] If iris-common.prop does not exist (normal for Deephaven versions 20190117 or earlier) or openapi-defaults.prop does not exist (normal for versions 20180803 or earlier):

cd /etc/sysconfig/illumon.d/resources/
# Move existing web_api_service props to openapi-defaults
cp web_api_service.prop openapi-defaults.prop
# Replace web_api_service with an includefiles on openapi-defaults
echo includefiles=openapi-defaults.prop > web_api_service.prop
# append the include to the end of the query server configuration
cat includefiles=openapi-defaults.prop >> iris-query-server.prop

Alternatively, you may wish to put your includefiles at the top of the iris-query-server.prop file, and manually delete/edit any properties from openapi-default.prop that are found in iris-query-server.prop. Putting the includefiles at the end of the file is easier because it will override other settings, but may be confusing that a property is defined then overridden. To keep things cleaner, remove/move any properties with a tls prefix to openapi-defaults.prop. You may also wish to move RemoteQueryDispatcher.websocket.enabled=true as well.

Securing the Web API Server with your CA-signed Certificate

While the default self-signed certificate is good enough for testing, it presents scary security warnings to users, and encourages users to ignore security warnings (a very bad habit), so you should always use a "real" CA-signed certificate for production use.

Obtain a TLS certificate signed by your trusted CA with the domain name matching the Deephaven server, e.g., myserver.mydomain.com.

Backup the existing file keystore file:

sudo cp /etc/sysconfig/illumon.d/auth/webServices-keystore.p12 \
/etc/sysconfig/illumon.d/auth/webServices-keystore.p12.ORG

Import your CA cert and key files to the Web API Service keystore file. For example:

STOREPASS=$(sudo cat /db/TempFiles/irisadmin/.webapi_passphrase | base64 --decode)
# This assumes you have stored your own .key and CA-provided .crt in /etc/ssl/certs/tls.* files
openssl pkcs12 -export -in /etc/ssl/certs/tls.crt -inkey /etc/ssl/certs/tls.key -name webapi -out /etc/sysconfig/illumon.d/auth/webServices-keystore.p12 -passout pass:$STOREPASS

Note

If you are unfamiliar with how to generate a .key and .csr file to get a .crt from a CA, please contact your IT organization.

Set the correct permissions on the web services keystore file:

sudo chown irisadmin:dbquery \

/etc/sysconfig/illumon.d/auth/webServices-keystore.p12

sudo chmod 440 /etc/sysconfig/illumon.d/auth/webServices-keystore.p12

Set/Verify Open API Props:

/etc/sysconfig/illumon.d/resources/iris-common.prop
WebServer.tls.keystore=/etc/sysconfig/illumon.d/auth/webServices-keystore.p12
WebServer.tls.passphrase.file=/db/TempFiles/irisadmin/.webapi_passphrase
# Enable Web Sockets for Query Workers
RemoteQueryDispatcher.websocket.enabled=true

Update Query Server Prop File: /etc/sysconfig/illumon.d/resources/iris-common.prop:

Replace two lines of content with the following:

# Set Dispatcher hostname to match the host for your CA-signed certificate:
RemoteQueryDispatcherParameters.host=myserver.mydomain.com

The host set above can also go into iris-common.prop, but it is not required.

Restart Web API Service with monit:

sudo -u irisadmin monit restart web_api_service

Client Update Service

The Client Update Service (CUS) is a process that updates clients with server-side components, including JARs, properties, etc. By default, each Web API Service's web server will host a CUS instance.

When the Client Update Service is running, you can install and run the Launcher on client desktops. The installers for Windows, Mac and Linux desktops can be downloaded from the Client Update Service on your Deephaven Server at:

http://<WEBHOST>/launcher

CUS Reload Procedure

To make new or modified server components available to clients, reload the Client Update Service by navigating to https://WEBHOST/reload/

Clients (e.g., the Swing UI) must exit and restart the launcher to download new components. A client that is not restarted may have outdated code or configuration that is incompatible with the Deephaven installation.

etcd Process

LevelImpact
Sev 1 - CriticalSchema, persistent queries, property files, routing configuration, and optionally ACLs are stored in etcd. etcd is used as a shared store for Authentication and Dispatcher runtime processing. Without etcd, the Deephaven system cannot function.

Procedures

Check Process is running with systemctl:

sudo systemctl status dh-etcd

Check endpoint status:

sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out table

View Log Files:

# Get all logs
sudo journalctl -xu dh-etcd
# Follow the logs
sudo journalctl -xefu dh-etcd

Restart Procedure:

sudo systemctl restart dh-etcd

Check connectivity using etcdctl.sh:

sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh role list

The etcdctl.sh script is a thin wrapper around etcdctl that passes in the correct user name and password for a given Deephaven role. Each user is stored in a directory of the form /etc/sysconfig/deephaven/etcd/client/<user>.

By default, the script uses the root user. To change the user, you can set the DH_ETCD_USER environment variable or specify the directory manually with the DH_ETCD_DIR environment variable. For example, to get a single schema (replace DbInternal and AuditEventLog with the namespace and name of the table of interest) with the schema-ro user, the following commands are equivalent:

sudo -u irisadmin DH_ETCD_USER=schema-ro /usr/illumon/latest/bin/etcdctl.sh get --prefix /main/config/schema/DbInternal/tables/AuditEventLog
sudo -u irisadmin DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/schema-ro /usr/illumon/latest/bin/etcdctl.sh get --prefix /main/config/schema/DbInternal/tables/AuditEventLog

Show current disk usage per node:

sudo /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out=table

MariaDB (MySQL) Process

If MySQL is used for ACLs, then the MySQL process is necessary for proper system function. If etcd is used for ACLs, then this process is not necessary.

LevelImpact
Sev 1 - CriticalThe Authentication Server, ACL Write Server and Deephaven Clients will be impacted. Query workers will also be affected and unable to check effective user permissions.

Procedures

Check Process is running:

sudo systemctl status mariadb

Sudo access required to view Log File:

sudo cat /var/log/mariadb/mariadb.log

Check Config File Settings:

/etc/my.cnf

Check Settings in Deephaven ACL Database: dbacl_iris

sudo mysql -e "show databases"
sudo mysql -D dbacl_iris -e "show tables"
sudo mysql -D dbacl_iris -e "select * from tableacls"

Restart Procedure:

sudo systemctl restart mariadb