Deephaven Process runbooks
This section outlines the procedures for each Deephaven process.
Incident classification key
Severity | Description |
---|---|
0 - None | Process is running (or down as scheduled). |
1 - Critical | Process is down when it should be up. |
2 - Moderate | Process is up when it should be down; or process is up but configuration is missing. |
3 - Low | Process is running but producing errors or performing poorly. |
Authentication Server Process
Level | Impact |
---|---|
Sev 1 - Critical | New users will be unable to login or create new queries |
Procedures
Check Process is running with Monit:
sudo monit status authentication_server
View Log File for successful startup messages:
cat /var/log/deephaven/authentication_server/AuthenticationServer.log.current
Check Property File Settings:
/etc/sysconfig/illumon.d/resources/*.prop
Check status of MariaDB/MySQL dependency:
sudo systemctl status mariadb
Restart Procedure:
sudo monit restart authentication_server
ACL Write Server Process
Level | Impact |
---|---|
Sev 2 - Moderate | Administrators will not be able to update user permissions and groups |
Procedures
Check process is running with Monit:
sudo monit status db_acl_write_server
View Log File for successful startup messages:
cat /var/log/deephaven/acl_write_server/DbAclWriteServer.log.current
Check Property File Settings:
/etc/sysconfig/illumon.d/resources/*.prop
Check status of MariaDB/MySQL dependency:
sudo systemctl status mariadb
Restart Procedure:
sudo monit restart db_acl_write_server
Configuration Server Process
Level | Impact |
---|---|
Sev 1 - Critical | None of the system processes will be able to start. |
Procedures
Check Process is running with Monit:
sudo monit status configuration_server
View Log File for successful startup messages:
cat /var/log/deephaven/configuration_server/ConfigurationServer.log.current
Check Property File Settings:
/etc/sysconfig/illumon.d/resources/*.prop
Restart Procedure:
sudo monit restart configuration_server
Persistent Query Controller Process
Level | Impact |
---|---|
Sev 1 - Critical | All persistent queries for this controller will terminate. Users will not be able to view any persistent queries in the Deephaven Console. |
Procedures
Check Process is running with Monit:
sudo monit status iris_controller
View Log File for successful startup messages:
cat /var/log/deephaven/iris_controller/PersistentQueryController.log.current
Check Property File Settings:
/etc/sysconfig/illumon.d/resources/*.prop
Restart Procedure:
sudo monit restart iris_controller
Cache Backup and Restore Process
Level | Impact |
---|---|
Sev 1 - Critical | The controller cache is the location in which persistent queries are stored, so it is strongly recommended that periodic backups be taken of this data. The ability to restore persistent queries is critical. |
Procedures
To export all Deephaven queries, use the following command:
sudo /usr/illumon/latest/bin/iris controller_tool --export
By default, the file is named controllerToolExport.xml
and placed in the controller tool's workspace at:
/db/TempFiles/irisadmin/controller_tool
To import your queries to any controller running the same Deephaven version, use the following command:
sudo /usr/illumon/latest/bin/iris controller_tool --import
It may be useful to keep each query's serial ID so that user workspaces will continue to work. In this case, you can add the following parameter, which will keep each query's original serial, but not import any query if a query already exists with the same serial:
--retainSerial=keep
To keep the original serial IDs and also overwrite existing queries with the same IDs, instead use:
--retainSerial=replace
For full details, see the Persistent Query Controller Tool.
Log Aggregator Service (LAS) Process
Level | Impact |
---|---|
Sev 1 - Critical | Any process configured to use the LAS will fail to write logs to the database. This will cause failure of these processes, including the query workers. |
Procedures
Check Process is running with Monit:
sudo monit status log_aggregator_service
View Log File for successful startup messages:
cat /var/log/deephaven/las/LogAggregatorService.log.current
Check Property File Settings:
/etc/sysconfig/illumon.d/resources/*.prop
Restart Procedure:
sudo monit restart log_aggregator_service
Alternative procedure
Disable the LAS.
Warning
This requires restarting the Remote Query Dispatcher which will stop all running queries.
To disable the LAS and have processes write their logs to plain text log files, add the following properties to iris-environment.prop
:
RemoteQueryProcessor.sendLogsToSystemOut=true
RemoteQueryProcessor.writeDatabaseProcessLogs=false
RemoteQueryProcessor.writeDatabaseAuditLogs=false
RemoteQueryDispatcher.writeDatabaseProcessLogs=false
PersistentQueryController.writeDatabaseAuditLogs=false
DbAclWriteServer.writeDatabaseAuditLogs=false
AuthenticationServer.writeDatabaseAuditLogs=false
Restart the affected Deephaven processes:
sudo monit restart log_aggregator_service
sudo monit restart db_acl_write_server
sudo monit restart authentication_server
sudo monit restart db_query_server
sudo monit restart db_merge_server
Tailer 1 Process
Level | Impact |
---|---|
Sev 2 - Moderate | Users will not be directly affected, but internal Deephaven logs (including state, configuration, process and event logs) will not be written to the database. |
Procedures
Check Process is running with Monit:
sudo monit status tailer1
View Log File for successful startup messages:
cat /var/log/deephaven/tailer/LogtailerMain1.log.current
Check Property File Settings:
/etc/sysconfig/illumon.d/resources/*.prop
Restart Procedure:
sudo monit restart tailer1
Remote Table Appender (Data Import Server) Process
Level | Impact |
---|---|
Sev 1 - Critical | Intraday user data will not be available and updates cannot be written to the database. |
The Remote Table Appender is an instance of a Data Import Server, and in many cases it is the same process as the main Data Import Server Process. If this is the case, refer to Data Import Server Process.
If you have configured a separate process for RTA, you will need to refer to your system to find the service name and configuration. This documentation assumes it is db_rta
.
Procedures
Check Process is running with Monit:
sudo monit status db_rta
View Log File for successful startup messages:
cat /var/log/deephaven/dis/<configured process.name>.log.current
Check Property File Settings:
/etc/sysconfig/illumon.d/resources/iris-common.prop
Restart Procedure:
sudo monit restart db_rta
Data Import Server Process
Level | Impact |
---|---|
Sev 1 - Critical | Binary log file data will not be written to the database. Binary store imports will fail. |
Procedures
Check Process is running with Monit:
sudo monit status db_dis
View Log File for successful startup messages:
cat /var/log/deephaven/dis/DataImportServer.log.current
Check Property File Settings:
/etc/sysconfig/illumon.d/resources/*.prop
Restart Procedure:
sudo monit restart db_dis
Procedure for cleaning up corrupt intraday data
In the event that intraday ticking data becomes corrupted, you do not need to stop the DIS (since the March 2018 release). Instead, simply clean up the intraday data and the DIS's state. In general, that means the following commands, run as the dbmerge
user:
rm -r /db/Intraday/[namespace]/[tablename]/[intraday partition]/[date]
rm /db/TempFiles/dbmerge/db_dis/[intraday partition]/[date]/[namespace].[tablename].userstate
rm /db/TempFiles/dbmerge/db_dis/[intraday partition]/[date]/<log file name>.header # We can probably skip this safely
rm /db/TempFiles/dbmerge/db_dis/[intraday partition].[date].[namespace].[tablename].loaderState
rm /db/Systems/[namespace]/ImportDetails/[intraday partition]/[date]/[tablename].importDetails
For your Order/Event table, you might use:
rm -r /db/Intraday/Order/Event/*/2018-02-09
rm /db/TempFiles/dbmerge/db_dis/*/2018-02-09/Order.Event.userstate
rm /db/TempFiles/dbmerge/db_dis/*/2018-02-09/*Event*bin.header
rm /db/TempFiles/dbmerge/db_dis/*.2018-02-09.Order.Event.loaderState
rm /db/Systems/Order/ImportDetails/*/2018-02-09/Event.importDetails
Note that in the latest Deephaven release, you do not need to stop the DIS, and instead simply need to run:
rm -r /db/Intraday/Order/Event/*/2018-02-09
Deephaven Merge Server Process
Level | Impact |
---|---|
Sev 2 - Moderate | Persistent queries for Merges and Imports will fail. |
Procedures
Check Process is running with Monit:
sudo monit status db_merge_server
View Log File for successful startup messages:
cat /var/log/deephaven/merge_server/db_merge_server.log.current
Check Property File Settings:
/etc/sysconfig/illumon.d/resources/*.prop
Restart Procedure:
sudo monit restart db_merge_server
Remote Query Dispatcher Process
Level | Impact |
---|---|
Sev 1 - Critical | Any running query workers will terminate, and new ones cannot be started. This includes all running persistent queries as well as interactive consoles. |
Procedures
Check Process is running with Monit:
sudo monit status db_query_server
View Log File for successful startup messages:
cat /var/log/deephaven/query_server/RemoteQueryDispatcher.log.current
Check Property File Settings:
/etc/sysconfig/illumon.d/resources/*.prop
Restart Procedure:
sudo monit restart db_query_server
Process Shutdown
Each Deephaven process has a shutdown manager, set by the property default.processEnvironmentFactory
. The shutdown manager ensure that processes terminate in an orderly and timely manner.
If a process fails to terminate cleanly, the shutdown manager will stop it forcefully after a timeout set by property ShutdownManager.deephaven.shutdownTimeoutMillis
.
Modify the following default to change the timeout for worker and dispatcher shutdown.
# override the shutdown timeout for all workers
[service.name=dbquery|dbmerge] {
ShutdownManager.deephaven.shutdownTimeoutMillis=60000
}
Local Table Data Server Process
Level | Impact |
---|---|
Sev 1 - Critical | Intraday data for any dates other than currentDateNy() will not be available. |
Procedures
Check Process is running with Monit:
sudo monit status db_ltds
View Log File for successful startup messages:
cat /var/log/deephaven/ltds/LocalTableDataServer.log.current
Check Property File Settings:
/etc/sysconfig/illumon.d/resources/*.prop
Restart Procedure:
sudo monit restart db_ltds
Web API Service Process Table
Level | Impact |
---|---|
Sev 1 - Critical | Deephaven Console GUI Users will not be affected, but Web API clients be impacted. |
Procedures
Enable the Web API Service:
The Web API Service is disabled by default.
In the M/Monit config folder, remove the .disabled
extension from the Web API Service config file name and run monit reload. This will instruct the M/Monit daemon to reread its configuration and re-initialize.
cd /etc/sysconfig/illumon.d/monit
mv web_api_service.disabled web_api_service.conf
sudo monit reload
Check Process is running with Monit:
sudo monit status web_api_service
View Log File for successful startup messages:
cat /var/log/deephaven/misc/WebServer.log.current
Check Property File Settings:
/etc/sysconfig/illumon.d/resources/*.prop
If the above file does not exist (older installations), instead check
/etc/sysconfig/illumon.d/resources/openapi-defaults.prop
On newer installations, web_api_service.prop
and iris-query-server.prop
will both include openapi-defaults.prop
. This reflects the fact that most OpenAPI configuration is shared between the OpenAPI webserver and system query workers. See note [1] below for details.
Restart Procedure:
web_api_service
Web API Server TLS Keystore (.p12
keystore file)
The Web API Server's TLS keystore contains the certificate and private key of a TLS enabled service. You must keep this file private, and not distribute it to clients. The Web API Servers keystore file should be unique per node, with a certificate that is signed (issued) by a trusted CA.
The default self-signed key pair for the Web API Server is generated when installing the iris-config.rpm and saved to .p12 keystore file. This default keystore will work, but the browser will give security warnings until you use your own a CA-signed Certificate (see below).
[-r--r----- irisadmin dbquery ] webServices-keystore.p12
The Web Server keystore file is also protected by a unique randomly generated password stored in base64 encoded format in a read-only hidden file owned by user iriadmin
and readable by dbquery
group with permission set to 440:
[-r--r----- irisadmin dbquery] .webapi_passphrase
Important keystore properties and files
Keystore Filename:
/etc/sysconfig/illumon.d/auth/keystore.webServices-keystore.p12
Passphrase File:
/db/TempFiles/irisadmin/.webapi_passphrase
Property File:
/etc/sysconfig/illumon.d/resources/iris-common.prop
Note, if this file does not exist [1], you can edit the following instead:
/etc/sysconfig/illumon.d/resources/openapi-defaults.prop
For legacy installations, you can edit both of the following:
/etc/sysconfig/illumon.d/resources/iris-query-server.prop
/etc/sysconfig/illumon.d/resources/web_api_service.prop
Keystore Property:
WebServer.tls.keystore=/etc/sysconfig/illumon.d/auth/webServices-keystore.p12
Passphrase Property:
WebServer.tls.passphrase.file=/db/TempFiles/irisadmin/.webapi_passphrase
[1] If iris-common.prop
does not exist (normal for Deephaven versions 20190117 or earlier) or openapi-defaults.prop
does not exist (normal for versions 20180803 or earlier):
cd /etc/sysconfig/illumon.d/resources/
# Move existing web_api_service props to openapi-defaults
cp web_api_service.prop openapi-defaults.prop
# Replace web_api_service with an includefiles on openapi-defaults
echo includefiles=openapi-defaults.prop > web_api_service.prop
# append the include to the end of the query server configuration
cat includefiles=openapi-defaults.prop >> iris-query-server.prop
Alternatively, you may wish to put your includefiles
at the top of the iris-query-server.prop
file, and manually delete/edit any properties from openapi-default.prop
that are found in iris-query-server.prop
. Putting the includefiles at the end of the file is easier because it will override other settings, but may be confusing that a property is defined then overridden. To keep things cleaner, remove/move any properties with a tls
prefix to openapi-defaults.prop
. You may also wish to move RemoteQueryDispatcher.websocket.enabled=true
as well.
Securing the Web API Server with your CA-signed Certificate
While the default self-signed certificate is good enough for testing, it presents scary security warnings to users, and encourages users to ignore security warnings (a very bad habit), so you should always use a "real" CA-signed certificate for production use.
Obtain a TLS certificate signed by your trusted CA with the domain name matching the Deephaven server, e.g., myserver.mydomain.com.
Backup the existing file keystore file:
sudo cp /etc/sysconfig/illumon.d/auth/webServices-keystore.p12 \
/etc/sysconfig/illumon.d/auth/webServices-keystore.p12.ORG
Import your CA cert and key files to the Web API Service keystore file. For example:
STOREPASS=$(sudo cat /db/TempFiles/irisadmin/.webapi_passphrase | base64 --decode)
# This assumes you have stored your own .key and CA-provided .crt in /etc/ssl/certs/tls.* files
openssl pkcs12 -export -in /etc/ssl/certs/tls.crt -inkey /etc/ssl/certs/tls.key -name webapi -out /etc/sysconfig/illumon.d/auth/webServices-keystore.p12 -passout pass:$STOREPASS
Note
If you are unfamiliar with how to generate a .key
and .csr
file to get a .crt
from a CA, please read [this link](read this link), or contact a security professional to help you with obtaining a .key
and .crt
.
Set the correct permissions on the web services keystore file:
sudo chown irisadmin:dbquery \
/etc/sysconfig/illumon.d/auth/webServices-keystore.p12
sudo chmod 440 /etc/sysconfig/illumon.d/auth/webServices-keystore.p12
Set/Verify Open API Props:
/etc/sysconfig/illumon.d/resources/iris-common.prop
WebServer.tls.keystore=/etc/sysconfig/illumon.d/auth/webServices-keystore.p12
WebServer.tls.passphrase.file=/db/TempFiles/irisadmin/.webapi_passphrase
# Enable Web Sockets for Query Workers
RemoteQueryDispatcher.websocket.enabled=true
Update Query Server Prop File: /etc/sysconfig/illumon.d/resources/iris-common.prop
:
Replace two lines of content with the following:
# Set Dispatcher hostname to match the host for your CA-signed certificate:
RemoteQueryDispatcherParameters.host=myserver.mydomain.com
The host set above can also go into iris-common.prop
, but it is not required.
Restart Web API Service with monit:
sudo monit restart web_api_service
Client Update Service Process (Lighttpd web server)
Level | Impact |
---|---|
Sev 2 - Moderate | Users will not be able to use the Launcher and Deephaven Clients will not be able to receive any updates from the server. |
Note
Procedures
The Client Update Service (CUS) is powered by lighttpd to update clients with server side components including, JARs, properties, etc.
The CUS is disabled by default for security reasons.
By default, the CUS does not require user authentication. The CUS is powered by lighttpd and provides basic and digest authentication methods described by RFC 2617.
To enable authentication with users defined in a file, edit /etc/lighttpd/client-update-service.conf
and uncomment the lines for mod_auth
and mod_authn_file
in the server.modules
section. Also uncomment the line (further down in the file) to include conf.d/iris-auth.conf
.
Authorized users are stored in the htpasswd file:
/etc/lighttpd/illumon-cus.user
The htpasswd file contains the username and the crypt()
'ed password separated by a colon. Each entry in the file is terminated by a single newline.
For example:
iris:$apr1$1xsLWNhw$.qiKafnbTpoNda/d6X77l.
You can use the htpasswd utility from the Apache distribution to manage htpasswd files. Note that not all versions of htpasswd default to use Apache's modified MD5 algorithm for passwords, which is required by lighttpd. You can force most to use MD5 by running:
htpasswd -nbm <user> <password>
Append the output of the above command to:
/etc/lighttpd/illumon-cus.user
More information on configuration options is available in lighttpd's documentation.
Securing the Customer Update Service (CUS) with HTTPS
To securely enable the CUS on HTTPS port 443:
Obtain a TLS certificate signed by your trusted CA with the domain name matching the Deephaven server, e.g: myserver.mydomain.com
Concatenate your .crt
and .key
file together into a single PEM file. For example:
cat /etc/ssl/private/lighttpd.key /etc/ssl/certs/lighttpd.crt \
> /etc/ssl/private/lighttpd.pem
On the Deephaven Server, edit the /etc/lighttpd/client-update-service.conf
file and set the following properties:
server.port = 443
ssl.engine = "enable"
ssl.pemfile = "/etc/ssl/private/lighttpd.pem"
Update /var/www/lighttpd/iris/iris/getdown.txt.pre
file as described in the previous section, replacing http
with https
. For example:
appbase = https://myserver.mydomain.com/iris/
...
#ui.install_error = http://WEBHOST/iris/error.html
ui.install_error = https://myserver.mydomain.com/iris/error.html
Restart the CUS with monit:
sudo monit restart cus
The "Client Update Service" will be available at: https://myserver.mydomain.com/
Check Process is Running with Monit:
sudo monit status client_update_service
Sudo access required to view Log File for successful startup messages:
/var/log/lighttpd/cus-error.log
/var/log/lighttpd/cus-access.log
Sudo access required to check Config File Settings:
/etc/lighttpd/client-update-service.conf
Sudo access required to check Files in Document Root:
/var/www/lighttpd/iris/
Restart Procedure:
sudo monit restart client_update_service
To enable the CUS on cleartext HTTP port 80: (Note: This is not recommended. Only do this for testing only on a trusted private network.)
On the Deephaven Server, edit the /var/www/lighttpd/iris/iris/getdown.txt.pre
file:
Set the appbase value, replacing WEBHOST with the FQDN (or IP address) of your Deephaven Server.
For example:
#appbase = http://WEBHOST/iris/
appbase = http://myhost.domain.com/iris/
...
#ui.install_error = http://WEBHOST/iris/error.html
ui.install_error = http://myhost.domain.com/iris/error.html
In the M/Monit config folder, remove the .disabled
extension from the Client Update Service config file name and run monit reload. This will instruct the M/Monit daemon to reread its configuration and re-initialize.
cd /etc/sysconfig/illumon.d/monit
mv cus.conf.disabled cus.conf
monit reload
Check the status of the getdown service:
monit status client_update_service
Once the "Client Update Service" is up and running, you can proceed to install and run the Launcher on client desktops. The installers for Windows, Mac and Linux desktops can be downloaded from the "Client Update Service" on your Deephaven Server at:
http://<IRIS_SERVER_ADDRESS>/
MariaDB (MySQL) Process
Level | Impact |
---|---|
Sev 1 - Critical | The Authentication Server, ACL Write Server and Deephaven Clients will be impacted. Query workers will also be affected and unable to check effective user permissions. |
Note
See: https://mariadb.org/
Procedures
Check Process is running:
sudo systemctl status mariadb
Sudo access required to view Log File for successful startup messages:
/var/log/mariadb/mariadb.log
Check Config File Settings:
/etc/my.cnf
Check Settings in Deephaven ACL Database: dbacl_iris
sudo mysql -e "show databases"
sudo mysql -D dbacl_iris -e "show tables"
sudo mysql -D dbacl_iris -e "select * from tableacls"
Restart Procedure:
sudo systemctl restart mariadb