Deephaven Process runbooks
This section outlines the procedures for each Deephaven process.
Incident classification key
Severity | Description |
---|---|
0 - None | Process is running (or down as scheduled). |
1 - Critical | Process is down when it should be up. |
2 - Moderate | Process is up when it should be down; or process is up but configuration is missing. |
3 - Low | Process is running but producing errors or performing poorly. |
Authentication Server Process
Level | Impact |
---|---|
Sev 1 - Critical | New users will be unable to login or create new queries |
Procedures
Check Process is running with Monit:
sudo -u irisadmin monit status authentication_server
View Application Log Files:
cat /var/log/deephaven/authentication_server/AuthenticationServer.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/authentication_server/authentication_server.log.????-??-??
Check status of MariaDB/MySQL dependency (if MySQL is used to store ACLs):
sudo systemctl status mariadb
Check etcd endpoint status (if etcd is used to store ACLs):
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out table
Restart Procedure:
sudo -u irisadmin monit restart authentication_server
ACL Write Server Process
Level | Impact |
---|---|
Sev 2 - Moderate | Administrators will not be able to update user permissions and groups |
Procedures
Check process is running with Monit:
sudo -u irisadmin monit status db_acl_write_server
View Application Log Files:
cat /var/log/deephaven/acl_write_server/DbAclWriteServer.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/acl_write_server/db_acl_write_server.log.????-??-??
Check status of MariaDB/MySQL dependency (if MySQL is used to store ACLs):
sudo systemctl status mariadb
Check etcd endpoint status (if etcd is used to store ACLs):
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out table
Restart Procedure:
sudo -u irisadmin monit restart db_acl_write_server
Configuration Server Process
Level | Impact |
---|---|
Sev 1 - Critical | None of the system processes will be able to start. |
Procedures
Check Process is running with Monit:
sudo -u irisadmin monit status configuration_server
View Application Log Files:
cat /var/log/deephaven/configuration_server/ConfigurationServer.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/configuration_server/configuration_server.log.????-??-??
Restart Procedure:
sudo -u irisadmin monit restart configuration_server
Persistent Query Controller Process
| Level | Impact | | :--------------- | :------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Sev 1 - Critical | When configured to run with multiple controllers, running Core+ queries are migrated to a running controller. | Core+ queries that are not yet running are terminated. All Legacy queries, including WebClientData, are terminated. Until the WebClientData reinitializes, users are not able to load the Deephaven console. |
Procedures
Check Process is running with Monit:
sudo -u irisadmin monit status iris_controller
View Application Log Files:
cat /var/log/deephaven/iris_controller/PersistentQueryController.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/iris_controller/iris_controller.log.????-??-??
Restart Procedure:
sudo -u irisadmin monit restart iris_controller
Persistent Query Backup and Restore Process
Level | Impact |
---|---|
Sev 1 - Critical | The controller stores persistent queries in etcd, so it is strongly recommended that periodic backups be taken of this data. The ability to restore persistent queries is critical. |
Procedures
To export all Deephaven queries, use the following command:
sudo /usr/illumon/latest/bin/dhconfig pq export --file /tmp/PersistentQueryBackup.xml
To import your queries to any controller running the same Deephaven version, use the following command:
sudo /usr/illumon/latest/bin/dhconfig pq import --file /tmp/PersistentQueryBackup.xml
It may be useful to keep each query's serial ID so that user workspaces will continue to work. In this case, you can add the following parameter, which will keep each query's original serial, but not import any query if a query already exists with the same serial:
--retainSerial=keep
To keep the original serial IDs and also overwrite existing queries with the same IDs, instead use:
--retainSerial=replace
For full details, see the Persistent Query Controller Tool.
Log Aggregator Service (LAS) Process
Level | Impact |
---|---|
Sev 1 - Critical | Any process configured to use the LAS will fail to write logs to the database. This will cause failure of these processes, including the query workers. |
Procedures
Check Process is running with Monit:
sudo -u irisadmin monit status log_aggregator_service
View Application Log Files:
cat /var/log/deephaven/las/LogAggregatorService.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/las/log_aggregator_service.log.????-??-??
Restart Procedure:
sudo -u irisadmin monit restart log_aggregator_service
Tailer 1 Process
Level | Impact |
---|---|
Sev 2 - Moderate | Users will not be directly affected, but internal Deephaven logs (including state, configuration, process and event logs) will not be written to the database. |
Procedures
Check Process is running with Monit:
sudo -u irisadmin monit status tailer1
View Application Log Files:
cat /var/log/deephaven/tailer/LogtailerMain1.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/tailer/tailer1.log.????-??-??
Restart Procedure:
sudo -u irisadmin monit restart tailer1
Data Import Server Process
Level | Impact |
---|---|
Sev 1 - Critical | Binary log file data will not be written to the database. Binary store imports will fail. |
Procedures
Check Process is running with Monit:
sudo -u irisadmin monit status db_dis
View Application Log Files:
cat /var/log/deephaven/dis/DataImportServer.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/dis/db_dis.log.????-??-??
Restart Procedure:
sudo -u irisadmin monit restart db_dis
Procedure for cleaning up corrupt intraday data
In the event that intraday ticking data becomes corrupted, you do not need to stop the DIS. Instead, simply clean up the intraday data and the DIS's state. In general, that means the following commands, run as the dbmerge
user:
rm -r /db/Intraday/[namespace]/[tablename]/[intraday partition]/[date]
For your Order/Event table, you might use:
rm -r /db/Intraday/Order/Event/*/2018-02-09
Deephaven Merge Server Process
Level | Impact |
---|---|
Sev 2 - Moderate | Persistent queries for Merges and Imports will fail. |
Procedures
Check Process is running with Monit:
sudo -u irisadmin monit status db_merge_server
View Application Log Files:
cat /var/log/deephaven/merge_server/db_merge_server.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/merge_server/db_merge_server.log.????-??-??
Restart Procedure:
sudo -u irisadmin monit restart db_merge_server
Remote Query Dispatcher Process
Level | Impact |
---|---|
Sev 1 - Critical | Any running query workers will terminate, and new ones cannot be started. This includes all running persistent queries as well as interactive consoles. |
Procedures
Check Process is running with Monit:
sudo -u irisadmin monit status db_query_server
View Application Log Files:
cat /var/log/deephaven/query_server/RemoteQueryDispatcher.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/query_server/db_query_server.log.????-??-??
Restart Procedure:
sudo -u irisadmin monit restart db_query_server
Process Shutdown
Each Deephaven process has a shutdown manager, set by the property default.processEnvironmentFactory
. The shutdown manager ensure that processes terminate in an orderly and timely manner.
If a process fails to terminate cleanly, the shutdown manager will stop it forcefully after a timeout set by property ShutdownManager.deephaven.shutdownTimeoutMillis
.
Modify the following default to change the timeout for worker and dispatcher shutdown.
# override the shutdown timeout for all workers
[service.name=dbquery|dbmerge] {
ShutdownManager.deephaven.shutdownTimeoutMillis=60000
}
Table Data Cache Proxy Process
Level | Impact |
---|---|
Sev 1 - Critical | Intraday data will not be available. |
Procedures
Check Process is running with Monit:
sudo -u irisadmin monit status db_tdcp
View Application Log Files:
cat /var/log/deephaven/tdcp/TableDataCacheProxy.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/tdcp/db_tdcp.log.????-??-??
Restart Procedure:
sudo -u irisadmin monit restart db_ltds
Local Table Data Server Process
Level | Impact |
---|---|
Sev 2 - Moderate | If the LTDS is configured in the routing, then any data it serves will not be available. |
Procedures
Check Process is running with Monit:
sudo -u irisadmin monit status db_ltds
View Application Log Files:
cat /var/log/deephaven/ltds/LocalTableDataServer.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/ltds/db_ltds.log.????-??-??
Restart Procedure:
sudo -u irisadmin monit restart db_ltds
Status Dashboard
Level | Impact |
---|---|
Sev 2 - Moderate | Status dashboard data will not be available. |
Procedures
Check Process is running with Monit:
sudo -u irisadmin monit status status_dashboard
View Application Log Files:
cat /var/log/deephaven/status_dashboard/StatusDashboard.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/status_dashboard/status_dashboard.log.????-??-??
Restart Procedure:
sudo -u irisadmin monit restart db_ltds
Web API Service Process Table
Level | Impact |
---|---|
Sev 1 - Critical | Both Web API clients and Deephaven Console GUI Users will be impacted. Users will not be able to use the Launcher and Deephaven Clients will not be able to receive any updates from the server. |
Procedures
Enable the Web API Service:
The Web API Service is disabled by default.
In the M/Monit config folder, remove the .disabled
extension from the Web API Service config file name and run monit reload. This will instruct the M/Monit daemon to reread its configuration and re-initialize.
cd /etc/sysconfig/illumon.d/monit
mv web_api_service.disabled web_api_service.conf
sudo -u irisadmin monit reload
Check Process is running with Monit:
sudo -u irisadmin monit status web_api_service
View Application Log Files:
cat /var/log/deephaven/web_api_service/WebServer.log.current
List Log Files for Standard Out/Error:
ls -ltr /var/log/deephaven/web_api_service/web_api_service.log.????-??-??
Restart Procedure:
sudo -u irisadmin monit restart web_api_service
Web API Server TLS Keystore (.p12
keystore file)
The Web API Server's TLS keystore contains the certificate and private key of a TLS enabled service. You must keep this file private, and not distribute it to clients. The Web API Servers keystore file should be unique per node, with a certificate that is signed (issued) by a trusted CA.
The default self-signed key pair for the Web API Server is generated when installing the iris-config.rpm and saved to .p12 keystore file. This default keystore will work, but the browser will give security warnings until you use your own a CA-signed Certificate (see below).
[-r--r----- irisadmin dbquery ] webServices-keystore.p12
The Web Server keystore file is also protected by a unique randomly generated password stored in base64 encoded format in a read-only hidden file owned by user iriadmin
and readable by dbquery
group with permission set to 440:
[-r--r----- irisadmin dbquery] .webapi_passphrase
Important keystore properties and files
Keystore Filename:
/etc/sysconfig/illumon.d/auth/keystore.webServices-keystore.p12
Passphrase File:
/db/TempFiles/irisadmin/.webapi_passphrase
Keystore Property:
WebServer.tls.keystore=/etc/sysconfig/illumon.d/auth/webServices-keystore.p12
Passphrase Property:
WebServer.tls.passphrase.file=/db/TempFiles/irisadmin/.webapi_passphrase
[1] If iris-common.prop
does not exist (normal for Deephaven versions 20190117 or earlier) or openapi-defaults.prop
does not exist (normal for versions 20180803 or earlier):
cd /etc/sysconfig/illumon.d/resources/
# Move existing web_api_service props to openapi-defaults
cp web_api_service.prop openapi-defaults.prop
# Replace web_api_service with an includefiles on openapi-defaults
echo includefiles=openapi-defaults.prop > web_api_service.prop
# append the include to the end of the query server configuration
cat includefiles=openapi-defaults.prop >> iris-query-server.prop
Alternatively, you may wish to put your includefiles
at the top of the iris-query-server.prop
file, and manually delete/edit any properties from openapi-default.prop
that are found in iris-query-server.prop
. Putting the includefiles at the end of the file is easier because it will override other settings, but may be confusing that a property is defined then overridden. To keep things cleaner, remove/move any properties with a tls
prefix to openapi-defaults.prop
. You may also wish to move RemoteQueryDispatcher.websocket.enabled=true
as well.
Securing the Web API Server with your CA-signed Certificate
While the default self-signed certificate is good enough for testing, it presents scary security warnings to users, and encourages users to ignore security warnings (a very bad habit), so you should always use a "real" CA-signed certificate for production use.
Obtain a TLS certificate signed by your trusted CA with the domain name matching the Deephaven server, e.g., myserver.mydomain.com.
Backup the existing file keystore file:
sudo cp /etc/sysconfig/illumon.d/auth/webServices-keystore.p12 \
/etc/sysconfig/illumon.d/auth/webServices-keystore.p12.ORG
Import your CA cert and key files to the Web API Service keystore file. For example:
STOREPASS=$(sudo cat /db/TempFiles/irisadmin/.webapi_passphrase | base64 --decode)
# This assumes you have stored your own .key and CA-provided .crt in /etc/ssl/certs/tls.* files
openssl pkcs12 -export -in /etc/ssl/certs/tls.crt -inkey /etc/ssl/certs/tls.key -name webapi -out /etc/sysconfig/illumon.d/auth/webServices-keystore.p12 -passout pass:$STOREPASS
Note
If you are unfamiliar with how to generate a .key
and .csr
file to get a .crt
from a CA, please contact your IT organization.
Set the correct permissions on the web services keystore file:
sudo chown irisadmin:dbquery \
/etc/sysconfig/illumon.d/auth/webServices-keystore.p12
sudo chmod 440 /etc/sysconfig/illumon.d/auth/webServices-keystore.p12
Set/Verify Open API Props:
/etc/sysconfig/illumon.d/resources/iris-common.prop
WebServer.tls.keystore=/etc/sysconfig/illumon.d/auth/webServices-keystore.p12
WebServer.tls.passphrase.file=/db/TempFiles/irisadmin/.webapi_passphrase
# Enable Web Sockets for Query Workers
RemoteQueryDispatcher.websocket.enabled=true
Update Query Server Prop File: /etc/sysconfig/illumon.d/resources/iris-common.prop
:
Replace two lines of content with the following:
# Set Dispatcher hostname to match the host for your CA-signed certificate:
RemoteQueryDispatcherParameters.host=myserver.mydomain.com
The host set above can also go into iris-common.prop
, but it is not required.
Restart Web API Service with monit:
sudo -u irisadmin monit restart web_api_service
Client Update Service
The Client Update Service (CUS) is a process that updates clients with server-side components, including JARs, properties, etc. By default, each Web API Service's web server will host a CUS instance.
When the Client Update Service is running, you can install and run the Launcher on client desktops. The installers for Windows, Mac and Linux desktops can be downloaded from the Client Update Service on your Deephaven Server at:
http://<WEBHOST>/launcher
CUS Reload Procedure
To make new or modified server components available to clients, reload the Client Update Service by navigating to https://WEBHOST/reload/
Clients (e.g., the Swing UI) must exit and restart the launcher to download new components. A client that is not restarted may have outdated code or configuration that is incompatible with the Deephaven installation.
etcd
Process
Level | Impact |
---|---|
Sev 1 - Critical | Schema, persistent queries, property files, routing configuration, and optionally ACLs are stored in etcd. etcd is used as a shared store for Authentication and Dispatcher runtime processing. Without etcd, the Deephaven system cannot function. |
Procedures
Check Process is running with systemctl:
sudo systemctl status dh-etcd
Check endpoint status:
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out table
View Log Files:
# Get all logs
sudo journalctl -xu dh-etcd
# Follow the logs
sudo journalctl -xefu dh-etcd
Restart Procedure:
sudo systemctl restart dh-etcd
Check connectivity using etcdctl.sh
:
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh role list
The etcdctl.sh
script is a thin wrapper around etcdctl
that passes in the correct user name and password for a given Deephaven role. Each user is stored in a directory of the form /etc/sysconfig/deephaven/etcd/client/<user>
.
By default, the script uses the root
user. To change the user, you can set the DH_ETCD_USER
environment variable or specify the directory manually with the DH_ETCD_DIR
environment variable. For example, to get a single schema (replace DbInternal and AuditEventLog with the namespace and name of the table of interest) with the schema-ro
user, the following commands are equivalent:
sudo -u irisadmin DH_ETCD_USER=schema-ro /usr/illumon/latest/bin/etcdctl.sh get --prefix /main/config/schema/DbInternal/tables/AuditEventLog
sudo -u irisadmin DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/schema-ro /usr/illumon/latest/bin/etcdctl.sh get --prefix /main/config/schema/DbInternal/tables/AuditEventLog
Show current disk usage per node:
sudo /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out=table
MariaDB (MySQL) Process
If MySQL is used for ACLs, then the MySQL process is necessary for proper system function. If etcd is used for ACLs, then this process is not necessary.
Level | Impact |
---|---|
Sev 1 - Critical | The Authentication Server, ACL Write Server and Deephaven Clients will be impacted. Query workers will also be affected and unable to check effective user permissions. |
Note
See: https://mariadb.org/
Procedures
Check Process is running:
sudo systemctl status mariadb
Sudo access required to view Log File:
sudo cat /var/log/mariadb/mariadb.log
Check Config File Settings:
/etc/my.cnf
Check Settings in Deephaven ACL Database: dbacl_iris
sudo mysql -e "show databases"
sudo mysql -D dbacl_iris -e "show tables"
sudo mysql -D dbacl_iris -e "select * from tableacls"
Restart Procedure:
sudo systemctl restart mariadb