Deephaven Process runbooks

This section outlines the procedures for each Deephaven process.

Incident classification key

SeverityDescription
0 - NoneProcess is running (or down as scheduled).
1 - CriticalProcess is down when it should be up.
2 - ModerateProcess is up when it should be down; or process is up but configuration is missing.
3 - LowProcess is running but producing errors or performing poorly.

Authentication Server Process

LevelImpact
Sev 1 - CriticalNew users will be unable to login or create new queries

Procedures

Check Process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Check status of MariaDB/MySQL dependency (if MySQL is used to store ACLs):

Check etcd endpoint status (if etcd is used to store ACLs):

Restart Procedure:

ACL Write Server Process

LevelImpact
Sev 2 - ModerateAdministrators will not be able to update user permissions and groups

Procedures

Check process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Check status of MariaDB/MySQL dependency (if MySQL is used to store ACLs):

Check etcd endpoint status (if etcd is used to store ACLs):

Restart Procedure:

Configuration Server Process

LevelImpact
Sev 1 - CriticalNone of the system processes will be able to start.

Procedures

Check Process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Restart Procedure:

Persistent Query Controller Process

| Level | Impact | | :--------------- | :------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Sev 1 - Critical | When configured to run with multiple controllers, running Core+ queries are migrated to a running controller. | Core+ queries that are not yet running are terminated. All Legacy queries, including WebClientData, are terminated. Until the WebClientData reinitializes, users are not able to load the Deephaven console. |

Procedures

Check Process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Restart Procedure:

Persistent Query Backup and Restore Process

LevelImpact
Sev 1 - CriticalThe controller stores persistent queries in etcd, so it is strongly recommended that periodic backups be taken of this data. The ability to restore persistent queries is critical.

Procedures

To export all Deephaven queries, use the following command:

To import your queries to any controller running the same Deephaven version, use the following command:

It may be useful to keep each query's serial ID so that user workspaces will continue to work. In this case, you can add the following parameter, which will keep each query's original serial, but not import any query if a query already exists with the same serial:

--retainSerial=keep

To keep the original serial IDs and also overwrite existing queries with the same IDs, instead use:

--retainSerial=replace

For full details, see the Persistent Query Controller Tool.

Log Aggregator Service (LAS) Process

LevelImpact
Sev 1 - CriticalAny process configured to use the LAS will fail to write logs to the database. This will cause failure of these processes, including the query workers.

Procedures

Check Process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Restart Procedure:

Tailer 1 Process

LevelImpact
Sev 2 - ModerateUsers will not be directly affected, but internal Deephaven logs (including state, configuration, process and event logs) will not be written to the database.

Procedures

Check Process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Restart Procedure:

Data Import Server Process

LevelImpact
Sev 1 - CriticalBinary log file data will not be written to the database. Binary store imports will fail.

Procedures

Check Process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Restart Procedure:

Procedure for cleaning up corrupt intraday data

In the event that intraday ticking data becomes corrupted, you do not need to stop the DIS. Instead, simply clean up the intraday data and the DIS's state. In general, that means the following commands, run as the dbmerge user:

For your Order/Event table, you might use:

Deephaven Merge Server Process

LevelImpact
Sev 2 - ModeratePersistent queries for Merges and Imports will fail.

Procedures

Check Process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Restart Procedure:

Remote Query Dispatcher Process

LevelImpact
Sev 1 - CriticalAny running query workers will terminate, and new ones cannot be started. This includes all running persistent queries as well as interactive consoles.

Procedures

Check Process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Restart Procedure:

Process Shutdown

Each Deephaven process has a shutdown manager, set by the property default.processEnvironmentFactory. The shutdown manager ensure that processes terminate in an orderly and timely manner. If a process fails to terminate cleanly, the shutdown manager will stop it forcefully after a timeout set by property ShutdownManager.deephaven.shutdownTimeoutMillis. Modify the following default to change the timeout for worker and dispatcher shutdown.

Table Data Cache Proxy Process

LevelImpact
Sev 1 - CriticalIntraday data will not be available.

Procedures

Check Process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Restart Procedure:

Local Table Data Server Process

For an architectural overview of this process and its typical uses, see Local Table Data Service.

LevelImpact
Sev 2 - ModerateIf the LTDS is configured in the routing, then any data it serves will not be available.

Procedures

Check Process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Restart Procedure:

Status Dashboard

LevelImpact
Sev 2 - ModerateStatus dashboard data will not be available.

Procedures

Check Process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Restart Procedure:

Web API Service Process Table

LevelImpact
Sev 1 - CriticalBoth Web API clients and Deephaven Console GUI Users will be impacted. Users will not be able to use the Launcher and Deephaven Clients will not be able to receive any updates from the server.

Procedures

Enable the Web API Service:

The Web API Service is disabled by default.

In the M/Monit config folder, remove the .disabled extension from the Web API Service config file name and run monit reload. This will instruct the M/Monit daemon to reread its configuration and re-initialize.

Check Process is running with Monit:

View Application Log Files:

List Log Files for Standard Out/Error:

Restart Procedure:

Web API Server TLS Keystore (.p12 keystore file)

The Web API Server's TLS keystore contains the certificate and private key of a TLS enabled service. You must keep this file private, and not distribute it to clients. The Web API Servers keystore file should be unique per node, with a certificate that is signed (issued) by a trusted CA.

The default self-signed key pair for the Web API Server is generated when installing the iris-config.rpm and saved to .p12 keystore file. This default keystore will work, but the browser will give security warnings until you use your own a CA-signed Certificate (see below).

[-r--r----- irisadmin dbquery ] webServices-keystore.p12

The Web Server keystore file is also protected by a unique randomly generated password stored in base64 encoded format in a read-only hidden file owned by user iriadmin and readable by dbquery group with permission set to 440:

[-r--r----- irisadmin dbquery] .webapi_passphrase

Important keystore properties and files

Keystore Filename: /etc/sysconfig/illumon.d/auth/keystore.webServices-keystore.p12

Passphrase File: /db/TempFiles/irisadmin/.webapi_passphrase

Keystore Property: WebServer.tls.keystore=/etc/sysconfig/illumon.d/auth/webServices-keystore.p12

Passphrase Property: WebServer.tls.passphrase.file=/db/TempFiles/irisadmin/.webapi_passphrase

[1] If iris-common.prop does not exist (normal for Deephaven versions 20190117 or earlier) or openapi-defaults.prop does not exist (normal for versions 20180803 or earlier):

Alternatively, you may wish to put your includefiles at the top of the iris-query-server.prop file, and manually delete/edit any properties from openapi-default.prop that are found in iris-query-server.prop. Putting the includefiles at the end of the file is easier because it will override other settings, but may be confusing that a property is defined then overridden. To keep things cleaner, remove/move any properties with a tls prefix to openapi-defaults.prop. You may also wish to move RemoteQueryDispatcher.websocket.enabled=true as well.

Securing the Web API Server with your CA-signed Certificate

While the default self-signed certificate is good enough for testing, it presents scary security warnings to users, and encourages users to ignore security warnings (a very bad habit), so you should always use a "real" CA-signed certificate for production use.

Obtain a TLS certificate signed by your trusted CA with the domain name matching the Deephaven server, e.g., myserver.mydomain.com.

Backup the existing file keystore file:

Import your CA cert and key files to the Web API Service keystore file. For example:

Note

If you are unfamiliar with how to generate a .key and .csr file to get a .crt from a CA, please contact your IT organization.

Set the correct permissions on the web services keystore file:

Set/Verify Open API Props:

Update Query Server Prop File: /etc/sysconfig/illumon.d/resources/iris-common.prop:

Replace two lines of content with the following:

The host set above can also go into iris-common.prop, but it is not required.

Restart Web API Service with monit:

Client Update Service

The Client Update Service (CUS) is a process that updates clients with server-side components, including JARs, properties, etc. By default, each Web API Service's web server will host a CUS instance.

When the Client Update Service is running, you can install and run the Launcher on client desktops. The installers for Windows, Mac and Linux desktops can be downloaded from the Client Update Service on your Deephaven Server at:

http://<WEBHOST>/launcher

CUS Reload Procedure

To make new or modified server components available to clients, reload the Client Update Service by navigating to https://WEBHOST/reload/

Clients (e.g., the Swing UI) must exit and restart the launcher to download new components. A client that is not restarted may have outdated code or configuration that is incompatible with the Deephaven installation.

Note

The Client Update Service is hosted by the Web API Service. This service does not refresh properties before reloading the CUS. If any properties have changed that affect the CUS configuration, such as those described in the client update service customer-updatable values documentation, you must restart the Web API Service.

etcd Process

LevelImpact
Sev 1 - CriticalSchema, persistent queries, property files, routing configuration, and optionally ACLs are stored in etcd. etcd is used as a shared store for Authentication and Dispatcher runtime processing. Without etcd, the Deephaven system cannot function.

Procedures

Check Process is running with systemctl:

Check endpoint status:

View Log Files:

Restart Procedure:

Check connectivity using etcdctl.sh:

The etcdctl.sh script is a thin wrapper around etcdctl that passes in the correct user name and password for a given Deephaven role. Each user is stored in a directory of the form /etc/sysconfig/deephaven/etcd/client/<user>.

By default, the script uses the root user. To change the user, you can set the DH_ETCD_USER environment variable or specify the directory manually with the DH_ETCD_DIR environment variable. For example, to get a single schema (replace DbInternal and AuditEventLog with the namespace and name of the table of interest) with the schema-ro user, the following commands are equivalent:

Show current disk usage per node:

MariaDB (MySQL) Process

If MySQL is used for ACLs, then the MySQL process is necessary for proper system function. If etcd is used for ACLs, then this process is not necessary.

LevelImpact
Sev 1 - CriticalThe Authentication Server, ACL Write Server and Deephaven Clients will be impacted. Query workers will also be affected and unable to check effective user permissions.

Procedures

Check Process is running:

Sudo access required to view Log File:

sudo cat /var/log/mariadb/mariadb.log

Check Config File Settings:

/etc/my.cnf

Check Settings in Deephaven ACL Database: dbacl_iris

Restart Procedure: