Deephaven Process runbooks

This section outlines the procedures for each Deephaven process.

Incident classification key

SeverityDescription
0 - NoneProcess is running (or down as scheduled).
1 - CriticalProcess is down when it should be up.
2 - ModerateProcess is up when it should be down; or process is up but configuration is missing.
3 - LowProcess is running but producing errors or performing poorly.

Authentication Server Process

LevelImpact
Sev 1 - CriticalNew users will be unable to login or create new queries

Procedures

Check Process is running with Monit:

sudo monit status authentication_server

View Log File for successful startup messages:

cat /var/log/deephaven/authentication_server/AuthenticationServer.log.current

Check Property File Settings:

/etc/sysconfig/illumon.d/resources/*.prop

Check status of MariaDB/MySQL dependency:

sudo systemctl status mariadb

Restart Procedure:

sudo monit restart authentication_server

ACL Write Server Process

LevelImpact
Sev 2 - ModerateAdministrators will not be able to update user permissions and groups

Procedures

Check process is running with Monit:

sudo monit status db_acl_write_server

View Log File for successful startup messages:

cat /var/log/deephaven/acl_write_server/DbAclWriteServer.log.current

Check Property File Settings:

/etc/sysconfig/illumon.d/resources/*.prop

Check status of MariaDB/MySQL dependency:

sudo systemctl status mariadb

Restart Procedure:

sudo monit restart db_acl_write_server

Configuration Server Process

LevelImpact
Sev 1 - CriticalNone of the system processes will be able to start.

Procedures

Check Process is running with Monit:

sudo monit status configuration_server

View Log File for successful startup messages:

cat /var/log/deephaven/configuration_server/ConfigurationServer.log.current

Check Property File Settings:

/etc/sysconfig/illumon.d/resources/*.prop

Restart Procedure:

sudo monit restart configuration_server

Persistent Query Controller Process

LevelImpact
Sev 1 - CriticalAll persistent queries for this controller will terminate. Users will not be able to view any persistent queries in the Deephaven Console.

Procedures

Check Process is running with Monit:

sudo monit status iris_controller

View Log File for successful startup messages:

cat /var/log/deephaven/iris_controller/PersistentQueryController.log.current

Check Property File Settings:

/etc/sysconfig/illumon.d/resources/*.prop

Restart Procedure:

sudo monit restart iris_controller

Cache Backup and Restore Process

LevelImpact
Sev 1 - CriticalThe controller cache is the location in which persistent queries are stored, so it is strongly recommended that periodic backups be taken of this data. The ability to restore persistent queries is critical.

Procedures

To export all Deephaven queries, use the following command:

sudo /usr/illumon/latest/bin/iris controller_tool --export

By default, the file is named controllerToolExport.xml and placed in the controller tool's workspace at:

/db/TempFiles/irisadmin/controller_tool

To import your queries to any controller running the same Deephaven version, use the following command:

sudo /usr/illumon/latest/bin/iris controller_tool --import

It may be useful to keep each query's serial ID so that user workspaces will continue to work. In this case, you can add the following parameter, which will keep each query's original serial, but not import any query if a query already exists with the same serial:

--retainSerial=keep

To keep the original serial IDs and also overwrite existing queries with the same IDs, instead use:

--retainSerial=replace

For full details, see the Persistent Query Controller Tool.

Log Aggregator Service (LAS) Process

LevelImpact
Sev 1 - CriticalAny process configured to use the LAS will fail to write logs to the database. This will cause failure of these processes, including the query workers.

Procedures

Check Process is running with Monit:

sudo monit status log_aggregator_service

View Log File for successful startup messages:

cat /var/log/deephaven/las/LogAggregatorService.log.current

Check Property File Settings:

/etc/sysconfig/illumon.d/resources/*.prop

Restart Procedure:

sudo monit restart log_aggregator_service

Alternative procedure

Disable the LAS.

Warning

This requires restarting the Remote Query Dispatcher which will stop all running queries.

To disable the LAS and have processes write their logs to plain text log files, add the following properties to iris-environment.prop:

RemoteQueryProcessor.sendLogsToSystemOut=true
RemoteQueryProcessor.writeDatabaseProcessLogs=false
RemoteQueryProcessor.writeDatabaseAuditLogs=false
RemoteQueryDispatcher.writeDatabaseProcessLogs=false
PersistentQueryController.writeDatabaseAuditLogs=false
DbAclWriteServer.writeDatabaseAuditLogs=false
AuthenticationServer.writeDatabaseAuditLogs=false

Restart the affected Deephaven processes:

sudo monit restart log_aggregator_service
sudo monit restart db_acl_write_server
sudo monit restart authentication_server
sudo monit restart db_query_server
sudo monit restart db_merge_server

Tailer 1 Process

LevelImpact
Sev 2 - ModerateUsers will not be directly affected, but internal Deephaven logs (including state, configuration, process and event logs) will not be written to the database.

Procedures

Check Process is running with Monit:

sudo monit status tailer1

View Log File for successful startup messages:

cat /var/log/deephaven/tailer/LogtailerMain1.log.current

Check Property File Settings:

/etc/sysconfig/illumon.d/resources/*.prop

Restart Procedure:

sudo monit restart tailer1

Remote Table Appender (Data Import Server) Process

LevelImpact
Sev 1 - CriticalIntraday user data will not be available and updates cannot be written to the database.

The Remote Table Appender is an instance of a Data Import Server, and in many cases it is the same process as the main Data Import Server Process. If this is the case, refer to Data Import Server Process.

If you have configured a separate process for RTA, you will need to refer to your system to find the service name and configuration. This documentation assumes it is db_rta.

Procedures

Check Process is running with Monit:

sudo monit status db_rta

View Log File for successful startup messages:

cat /var/log/deephaven/dis/<configured process.name>.log.current

Check Property File Settings:

/etc/sysconfig/illumon.d/resources/iris-common.prop

Restart Procedure:

sudo monit restart db_rta

Data Import Server Process

LevelImpact
Sev 1 - CriticalBinary log file data will not be written to the database. Binary store imports will fail.

Procedures

Check Process is running with Monit:

sudo monit status db_dis

View Log File for successful startup messages:

cat /var/log/deephaven/dis/DataImportServer.log.current

Check Property File Settings:

/etc/sysconfig/illumon.d/resources/*.prop

Restart Procedure:

sudo monit restart db_dis

Procedure for cleaning up corrupt intraday data

In the event that intraday ticking data becomes corrupted, you do not need to stop the DIS (since the March 2018 release). Instead, simply clean up the intraday data and the DIS's state. In general, that means the following commands, run as the dbmerge user:

rm -r /db/Intraday/[namespace]/[tablename]/[intraday partition]/[date]

rm /db/TempFiles/dbmerge/db_dis/[intraday partition]/[date]/[namespace].[tablename].userstate

rm /db/TempFiles/dbmerge/db_dis/[intraday partition]/[date]/<log file name>.header # We can probably skip this safely

rm /db/TempFiles/dbmerge/db_dis/[intraday partition].[date].[namespace].[tablename].loaderState

rm /db/Systems/[namespace]/ImportDetails/[intraday partition]/[date]/[tablename].importDetails

For your Order/Event table, you might use:

rm -r /db/Intraday/Order/Event/*/2018-02-09

rm /db/TempFiles/dbmerge/db_dis/*/2018-02-09/Order.Event.userstate

rm /db/TempFiles/dbmerge/db_dis/*/2018-02-09/*Event*bin.header

rm /db/TempFiles/dbmerge/db_dis/*.2018-02-09.Order.Event.loaderState

rm /db/Systems/Order/ImportDetails/*/2018-02-09/Event.importDetails

Note that in the latest Deephaven release, you do not need to stop the DIS, and instead simply need to run:

rm -r /db/Intraday/Order/Event/*/2018-02-09

Deephaven Merge Server Process

LevelImpact
Sev 2 - ModeratePersistent queries for Merges and Imports will fail.

Procedures

Check Process is running with Monit:

sudo monit status db_merge_server

View Log File for successful startup messages:

cat /var/log/deephaven/merge_server/db_merge_server.log.current

Check Property File Settings:

/etc/sysconfig/illumon.d/resources/*.prop

Restart Procedure:

sudo monit restart db_merge_server

Remote Query Dispatcher Process

LevelImpact
Sev 1 - CriticalAny running query workers will terminate, and new ones cannot be started. This includes all running persistent queries as well as interactive consoles.

Procedures

Check Process is running with Monit:

sudo monit status db_query_server

View Log File for successful startup messages:

cat /var/log/deephaven/query_server/RemoteQueryDispatcher.log.current

Check Property File Settings:

/etc/sysconfig/illumon.d/resources/*.prop

Restart Procedure:

sudo monit restart db_query_server

Process Shutdown

Each Deephaven process has a shutdown manager, set by the property default.processEnvironmentFactory. The shutdown manager ensure that processes terminate in an orderly and timely manner. If a process fails to terminate cleanly, the shutdown manager will stop it forcefully after a timeout set by property ShutdownManager.deephaven.shutdownTimeoutMillis. Modify the following default to change the timeout for worker and dispatcher shutdown.

# override the shutdown timeout for all workers
[service.name=dbquery|dbmerge] {
    ShutdownManager.deephaven.shutdownTimeoutMillis=60000
}

Local Table Data Server Process

LevelImpact
Sev 1 - CriticalIntraday data for any dates other than currentDateNy() will not be available.

Procedures

Check Process is running with Monit:

sudo monit status db_ltds

View Log File for successful startup messages:

cat /var/log/deephaven/ltds/LocalTableDataServer.log.current

Check Property File Settings:

/etc/sysconfig/illumon.d/resources/*.prop

Restart Procedure:

sudo monit restart db_ltds

Web API Service Process Table

LevelImpact
Sev 1 - CriticalDeephaven Console GUI Users will not be affected, but Web API clients be impacted.

Procedures

Enable the Web API Service:

The Web API Service is disabled by default.

In the M/Monit config folder, remove the .disabled extension from the Web API Service config file name and run monit reload. This will instruct the M/Monit daemon to reread its configuration and re-initialize.

cd /etc/sysconfig/illumon.d/monit
mv web_api_service.disabled web_api_service.conf
sudo monit reload

Check Process is running with Monit:

sudo monit status web_api_service

View Log File for successful startup messages:

cat /var/log/deephaven/misc/WebServer.log.current

Check Property File Settings:

/etc/sysconfig/illumon.d/resources/*.prop

If the above file does not exist (older installations), instead check

/etc/sysconfig/illumon.d/resources/openapi-defaults.prop

On newer installations, web_api_service.prop and iris-query-server.prop will both include openapi-defaults.prop. This reflects the fact that most OpenAPI configuration is shared between the OpenAPI webserver and system query workers. See note [1] below for details.

Restart Procedure:

web_api_service

Web API Server TLS Keystore (.p12 keystore file)

The Web API Server's TLS keystore contains the certificate and private key of a TLS enabled service. You must keep this file private, and not distribute it to clients. The Web API Servers keystore file should be unique per node, with a certificate that is signed (issued) by a trusted CA.

The default self-signed key pair for the Web API Server is generated when installing the iris-config.rpm and saved to .p12 keystore file. This default keystore will work, but the browser will give security warnings until you use your own a CA-signed Certificate (see below).

[-r--r----- irisadmin dbquery ] webServices-keystore.p12

The Web Server keystore file is also protected by a unique randomly generated password stored in base64 encoded format in a read-only hidden file owned by user iriadmin and readable by dbquery group with permission set to 440:

[-r--r----- irisadmin dbquery] .webapi_passphrase

Important keystore properties and files

Keystore Filename: /etc/sysconfig/illumon.d/auth/keystore.webServices-keystore.p12

Passphrase File: /db/TempFiles/irisadmin/.webapi_passphrase

Property File: /etc/sysconfig/illumon.d/resources/iris-common.prop

Note, if this file does not exist [1], you can edit the following instead: /etc/sysconfig/illumon.d/resources/openapi-defaults.prop

For legacy installations, you can edit both of the following: /etc/sysconfig/illumon.d/resources/iris-query-server.prop /etc/sysconfig/illumon.d/resources/web_api_service.prop

Keystore Property: WebServer.tls.keystore=/etc/sysconfig/illumon.d/auth/webServices-keystore.p12

Passphrase Property: WebServer.tls.passphrase.file=/db/TempFiles/irisadmin/.webapi_passphrase

[1] If iris-common.prop does not exist (normal for Deephaven versions 20190117 or earlier) or openapi-defaults.prop does not exist (normal for versions 20180803 or earlier):

cd /etc/sysconfig/illumon.d/resources/
# Move existing web_api_service props to openapi-defaults
cp web_api_service.prop openapi-defaults.prop
# Replace web_api_service with an includefiles on openapi-defaults
echo includefiles=openapi-defaults.prop > web_api_service.prop
# append the include to the end of the query server configuration
cat includefiles=openapi-defaults.prop >> iris-query-server.prop

Alternatively, you may wish to put your includefiles at the top of the iris-query-server.prop file, and manually delete/edit any properties from openapi-default.prop that are found in iris-query-server.prop. Putting the includefiles at the end of the file is easier because it will override other settings, but may be confusing that a property is defined then overridden. To keep things cleaner, remove/move any properties with a tls prefix to openapi-defaults.prop. You may also wish to move RemoteQueryDispatcher.websocket.enabled=true as well.

Securing the Web API Server with your CA-signed Certificate

While the default self-signed certificate is good enough for testing, it presents scary security warnings to users, and encourages users to ignore security warnings (a very bad habit), so you should always use a "real" CA-signed certificate for production use.

Obtain a TLS certificate signed by your trusted CA with the domain name matching the Deephaven server, e.g., myserver.mydomain.com.

Backup the existing file keystore file:

sudo cp /etc/sysconfig/illumon.d/auth/webServices-keystore.p12 \
/etc/sysconfig/illumon.d/auth/webServices-keystore.p12.ORG

Import your CA cert and key files to the Web API Service keystore file. For example:

STOREPASS=$(sudo cat /db/TempFiles/irisadmin/.webapi_passphrase | base64 --decode)
# This assumes you have stored your own .key and CA-provided .crt in /etc/ssl/certs/tls.* files
openssl pkcs12 -export -in /etc/ssl/certs/tls.crt -inkey /etc/ssl/certs/tls.key -name webapi -out /etc/sysconfig/illumon.d/auth/webServices-keystore.p12 -passout pass:$STOREPASS

Note

If you are unfamiliar with how to generate a .key and .csr file to get a .crt from a CA, please read [this link](read this link), or contact a security professional to help you with obtaining a .key and .crt.

Set the correct permissions on the web services keystore file:

sudo chown irisadmin:dbquery \

/etc/sysconfig/illumon.d/auth/webServices-keystore.p12

sudo chmod 440 /etc/sysconfig/illumon.d/auth/webServices-keystore.p12

Set/Verify Open API Props:

/etc/sysconfig/illumon.d/resources/iris-common.prop
WebServer.tls.keystore=/etc/sysconfig/illumon.d/auth/webServices-keystore.p12
WebServer.tls.passphrase.file=/db/TempFiles/irisadmin/.webapi_passphrase
# Enable Web Sockets for Query Workers
RemoteQueryDispatcher.websocket.enabled=true

Update Query Server Prop File: /etc/sysconfig/illumon.d/resources/iris-common.prop:

Replace two lines of content with the following:

# Set Dispatcher hostname to match the host for your CA-signed certificate:
RemoteQueryDispatcherParameters.host=myserver.mydomain.com

The host set above can also go into iris-common.prop, but it is not required.

Restart Web API Service with monit:

sudo monit restart web_api_service

Client Update Service Process (Lighttpd web server)

LevelImpact
Sev 2 - ModerateUsers will not be able to use the Launcher and Deephaven Clients will not be able to receive any updates from the server.

Procedures

The Client Update Service (CUS) is powered by lighttpd to update clients with server side components including, JARs, properties, etc.

The CUS is disabled by default for security reasons.

By default, the CUS does not require user authentication. The CUS is powered by lighttpd and provides basic and digest authentication methods described by RFC 2617.

To enable authentication with users defined in a file, edit /etc/lighttpd/client-update-service.conf and uncomment the lines for mod_auth and mod_authn_file in the server.modules section. Also uncomment the line (further down in the file) to include conf.d/iris-auth.conf.

Authorized users are stored in the htpasswd file:

/etc/lighttpd/illumon-cus.user

The htpasswd file contains the username and the crypt()'ed password separated by a colon. Each entry in the file is terminated by a single newline.

For example:

iris:$apr1$1xsLWNhw$.qiKafnbTpoNda/d6X77l.

You can use the htpasswd utility from the Apache distribution to manage htpasswd files. Note that not all versions of htpasswd default to use Apache's modified MD5 algorithm for passwords, which is required by lighttpd. You can force most to use MD5 by running:

htpasswd -nbm <user> <password>

Append the output of the above command to:

/etc/lighttpd/illumon-cus.user

More information on configuration options is available in lighttpd's documentation.

Securing the Customer Update Service (CUS) with HTTPS

To securely enable the CUS on HTTPS port 443:

Obtain a TLS certificate signed by your trusted CA with the domain name matching the Deephaven server, e.g: myserver.mydomain.com

Concatenate your .crt and .key file together into a single PEM file. For example:

cat /etc/ssl/private/lighttpd.key /etc/ssl/certs/lighttpd.crt \
> /etc/ssl/private/lighttpd.pem

On the Deephaven Server, edit the /etc/lighttpd/client-update-service.conf file and set the following properties:

server.port = 443
ssl.engine = "enable"
ssl.pemfile = "/etc/ssl/private/lighttpd.pem"

Update /var/www/lighttpd/iris/iris/getdown.txt.pre file as described in the previous section, replacing http with https. For example:

appbase = https://myserver.mydomain.com/iris/
...
#ui.install_error = http://WEBHOST/iris/error.html
ui.install_error = https://myserver.mydomain.com/iris/error.html

Restart the CUS with monit:

sudo monit restart cus

The "Client Update Service" will be available at: https://myserver.mydomain.com/

Check Process is Running with Monit:

sudo monit status client_update_service

Sudo access required to view Log File for successful startup messages:

/var/log/lighttpd/cus-error.log /var/log/lighttpd/cus-access.log

Sudo access required to check Config File Settings:

/etc/lighttpd/client-update-service.conf

Sudo access required to check Files in Document Root:

/var/www/lighttpd/iris/

Restart Procedure:

sudo monit restart client_update_service

To enable the CUS on cleartext HTTP port 80: (Note: This is not recommended. Only do this for testing only on a trusted private network.)

On the Deephaven Server, edit the /var/www/lighttpd/iris/iris/getdown.txt.pre file:

Set the appbase value, replacing WEBHOST with the FQDN (or IP address) of your Deephaven Server.

For example:

#appbase = http://WEBHOST/iris/
appbase = http://myhost.domain.com/iris/
...
#ui.install_error = http://WEBHOST/iris/error.html
ui.install_error = http://myhost.domain.com/iris/error.html

In the M/Monit config folder, remove the .disabled extension from the Client Update Service config file name and run monit reload. This will instruct the M/Monit daemon to reread its configuration and re-initialize.

cd /etc/sysconfig/illumon.d/monit
mv cus.conf.disabled cus.conf
monit reload

Check the status of the getdown service:

monit status client_update_service

Once the "Client Update Service" is up and running, you can proceed to install and run the Launcher on client desktops. The installers for Windows, Mac and Linux desktops can be downloaded from the "Client Update Service" on your Deephaven Server at:

http://<IRIS_SERVER_ADDRESS>/

MariaDB (MySQL) Process

LevelImpact
Sev 1 - CriticalThe Authentication Server, ACL Write Server and Deephaven Clients will be impacted. Query workers will also be affected and unable to check effective user permissions.

Procedures

Check Process is running:

sudo systemctl status mariadb

Sudo access required to view Log File for successful startup messages:

/var/log/mariadb/mariadb.log

Check Config File Settings:

/etc/my.cnf

Check Settings in Deephaven ACL Database: dbacl_iris

sudo mysql -e "show databases"
sudo mysql -D dbacl_iris -e "show tables"
sudo mysql -D dbacl_iris -e "select * from tableacls"

Restart Procedure:

sudo systemctl restart mariadb