System Administration best practices

Backups

You should regularly take backups of your Deephaven installation, including data and configuration. Key areas include:

Configuration data. This includes configuration files, ACLs, schema files, the routing YAML, the WorkspaceData table, and possibly the whole etcd database. See the backup and restore page for further information; in particular, the backup script may be of use.
Persistent queries should be backed up periodically for easy restoration.
Historical data is written to a standard filesystem and can be very large. The system administrator should balance cost and required data availability, and different solutions (such as RAID6 or alternate-site backups) may be ideal for different types of historical data.
Intraday data. While you can't easily back up intraday data as it's being written, see the resilience planning section below for ways to write it to multiple servers at once.
User tables (in /db/Users and /db/IntradayUser) should be considered similar to historical data, with the system administrator determining the best solution to balance cost and availability.

Resilience Planning

Deephaven supports options to provide high availability in the event of the failure of individual nodes or processes. Production environments should consider the following suggestions (solutions and procedures should be verified in a test environment):

Multiple authentication servers provide resiliency against authentication failures.
Multiple configuration servers provide resiliency against configuration server issues.
Using a data import server failover group to provide a remote location containing duplicated intraday data; this provides quick failover in the event of a site disaster.
A multi-node etcd cluster should be used to ensure that a single etcd failure does not cause issues.
There is no automated way to copy Persistent Queries and configuration between sites, so backups should be periodically copied from the primary site to any backup installations.

See the resilience planning page for more details.

Merge, Validate, and Delete Intraday Data

Internal Deephaven data is written to internal tables on an ongoing basis. To avoid this data growing without bound and filling the limited intraday storage, data to be retained long-term should be merged and then deleted, optionally after validation.

Several tables are critical to merge and save. For these tables:

A merge should be run each night after all the data for the previous day has been written. See merging data for further information, including creating Persistent Queries to do this.
After the merge is complete, the previous day's intraday data should be deleted, possibly after validating the merged data. See validating data for further details, including creating Persistent Queries to do this.

Here are the Deephaven internal tables that are important to merge and save long-term.

AuditEventLog contains important events regarding the use of the system, including access and table use. On production servers, this should be kept indefinitely.
PersistentQueryConfigurationLogV2 includes details of all Persistent Query modifications and is used if a Persistent Query is reverted to a previous version.
PersistentQueryStateLog contains information about every Persistent Query run (including exception details for failed queries) and is critical for debugging Persistent Query failures.
WorkspaceData contains details on web server workspaces and dashboards and must be retained.
ProcessEventLog contains detailed output from every Persistent Query and code studio. This is critical to understand issues, but is large and less likely to be useful for more than a few days.

On the other hand, while the following tables are useful for understanding process and performance details, they may not be worth keeping long-term and merging. Consider keeping the data in intraday and then deleting the data after a set period of time such as 7 days. If they are kept longer (through merges), the data may be useful in debugging performance issues between releases.

Historical data directory structure

To ensure scalability to handle potentially large amounts of historical data, it is essential that the directory structure is set up appropriately. The section on mounting strategies for historical data contains suggestions on how to manage this.

Workspace Data snapshots

Deephaven provides a tool to create snapshots of the WorkspaceData table to the WorkspaceDataSnapshot table. Once snapshots are created, the web server no longer needs to scan the entire WorkspaceData table to discover user web data. Historical WorkspaceData partitions should not be deleted - they are not accessed by the WebClientData query in the current version of Deephaven but may still be useful to provide historical information.

As of the Grizzly release (1.20240517), a Persistent Query called WorkspaceSnapshot is automatically created to run this tool each night. Like merge and validation persistent queries, this should be checked for successful completion.

More information on the WorkspaceDataSnapshot table can be found at internal tables.

Monitoring

Proactive monitoring of the Deephaven installation is recommended to prevent issues impacting production systems. The status dashboard provides one option to help with this.

Etcd health

Etcd is one of the system's most critical components; if it fails, then the entire system will be unusable. The support page etcd section and etcd runbook provide some information on etcd monitoring.

The etcd cluster recovery guide should never be needed if etcd is properly monitored.

Deephaven processes

Several Deephaven processes are critical to the operation of the system and should be actively monitored. If these processes fail then various aspects of the Deephaven installation will not function.

The Authentication Server (authentication_server) is required for all Deephaven authentication. Multiple instances can be run to reduce the likelihood of failure. See the runbook for further information.
Processes retrieve their Deephaven configuration from the Configuration Server (configuration_server). If it fails then Deephaven processes won't start and running processes may fail. See the runbook for further information.
The Persistent Query Controller (iris_controller) controls all Persistent Queries. If it fails, all Persistent Queries will fail and won't be able to run until it's restarted. See the runbook and controller configuration pages for further information.
The Remote Query Dispatchers (db_merge_server and db_query_server) control workers, and if they fail all their workers are terminated. See the query server runbook and merge server runbook for further information.
The Data Import Server (db_dis) handles intraday data. If it fails then intraday data won't be available and many Persistent Queries will fail. See the runbook for further information.
The Table Data Cache Proxy (db_tdcp) serves intraday data, so if it fails that data won't be available. See the runbook for further information.

Data latency

Since Deephaven provides real-time data, it may be useful to monitor the current status of that data. Here is a Groovy script that provides the most recently updated timestamp for a table with a Timestamp column as a single row.

import io.deephaven.engine.util.SortedBy
PELWatch = SortedBy.sortedLastBy(db.liveTable("DbInternal", "ProcessEventLog").where("Date=today()").view("Timestamp"), "Timestamp")
AELWatch = SortedBy.sortedLastBy(db.liveTable("DbInternal", "AuditEventLog").where("Date=today()").view("Timestamp"), "Timestamp")