System Administration best practices

Backups

You should regularly take backups of your Deephaven installation, including data and configuration. Key areas include:

  • Configuration data. This includes configuration files, ACLs, schema files, the routing YAML, the WorkspaceData table, and possibly the whole etcd database. See the backup and restore page for further information; in particular, the backup script may be of use.
  • Persistent queries should be backed up periodically for easy restoration.
  • Historical data is written to a standard filesystem and can be very large. The system administrator should balance cost and required data availability, and different solutions (such as RAID6 or alternate-site backups) may be ideal for different types of historical data.
  • Intraday data. While you can't easily back up intraday data as it's being written, see the resilience planning section below for ways to write it to multiple servers at once.
  • User tables (in /db/Users and /db/IntradayUser) should be considered similar to historical data, with the system administrator determining the best solution to balance cost and availability.

Resilience Planning

Deephaven supports options to provide high availability in the event of the failure of individual nodes or processes. Production environments should consider the following suggestions (solutions and procedures should be verified in a test environment):

  • Multiple authentication servers provide resiliency against authentication failures.
  • Multiple configuration servers provide resiliency against configuration server issues.
  • Using a data import server failover group to provide a remote location containing duplicated intraday data; this provides quick failover in the event of a site disaster.
  • A multi-node etcd cluster should be used to ensure that a single etcd failure does not cause issues.
  • There is no automated way to copy Persistent Queries and configuration between sites, so backups should be periodically copied from the primary site to any backup installations.

See the resilience planning page for more details.

Data lifecycle management

Internal Deephaven data requires regular lifecycle management to prevent storage issues and ensure data protection. For comprehensive guidance on data storage strategies, merge procedures, validation processes, and specific table requirements, see Data and configuration storage.

Historical data directory structure

To ensure scalability to handle potentially large amounts of historical data, it is essential that the directory structure is set up appropriately. The section on mounting strategies for historical data contains suggestions on how to manage this.

Monitoring

Proactive monitoring of the Deephaven installation is recommended to prevent issues impacting production systems. The status dashboard provides one option to help with this.

Etcd health

Etcd is one of the system's most critical components; if it fails, then the entire system will be unusable. The support page etcd section and etcd runbook provide some information on etcd monitoring.

The etcd cluster recovery guide should never be needed if etcd is properly monitored.

Deephaven processes

Several Deephaven processes are critical to the operation of the system and should be actively monitored. If these processes fail then various aspects of the Deephaven installation will not function.

  • The Authentication Server (authentication_server) is required for all Deephaven authentication. Multiple instances can be run to reduce the likelihood of failure. See the runbook for further information.
  • Processes retrieve their Deephaven configuration from the Configuration Server (configuration_server). If it fails then Deephaven processes won't start and running processes may fail. See the runbook for further information.
  • The Persistent Query Controller (iris_controller) controls all Persistent Queries. If it fails, all Persistent Queries will fail and won't be able to run until it's restarted. See the runbook and controller configuration pages for further information.
  • The Remote Query Dispatchers (db_merge_server and db_query_server) control workers, and if they fail all their workers are terminated. See the query server runbook and merge server runbook for further information.
  • The Data Import Server (db_dis) handles intraday data. If it fails then intraday data won't be available and many Persistent Queries will fail. See the runbook for further information.
  • The Table Data Cache Proxy (db_tdcp) serves intraday data, so if it fails that data won't be available. See the runbook for further information.

Data latency

Since Deephaven provides real-time data, it may be useful to monitor the current status of that data. Here is a Groovy script that provides the most recently updated timestamp for a table with a Timestamp column as a single row.

import io.deephaven.engine.util.SortedBy
PELWatch = SortedBy.sortedLastBy(db.liveTable("DbInternal", "ProcessEventLog").where("Date=today()").view("Timestamp"), "Timestamp")
AELWatch = SortedBy.sortedLastBy(db.liveTable("DbInternal", "AuditEventLog").where("Date=today()").view("Timestamp"), "Timestamp")