Data and configuration storage

Protecting both critical operational data and system configuration is fundamental to ensuring the resilience of a Deephaven deployment. A robust strategy involves understanding where different types of data and configuration are stored, their respective lifecycles, and the mechanisms available to safeguard them against loss or corruption. This document details Deephaven's approach to data and configuration storage, covering centralized cluster settings, application-specific UI configurations, various forms of table data (intraday, historical, user-saved), and essential system files. By outlining these storage strategies and their associated resilience mechanisms, this guide provides administrators with the knowledge to implement comprehensive data protection and ensure consistent system behavior.

Centralized cluster configuration

The core configuration of a Deephaven cluster is managed centrally to ensure consistency across all nodes.

  • Content: This includes Persistent Query (PQ) definitions, user and group information, table routing settings, and other critical system settings.
  • Storage: This information is stored in etcd, a distributed key-value store.
  • Resilience: etcd provides automatic fault tolerance. When deployed with an odd number of nodes (three or more), an etcd cluster can withstand node failures while maintaining availability and data consistency.

Application and UI configuration

User-specific configurations, such as web dashboards and UI layouts, are stored in a Deephaven table.

  • Content: User-created dashboards, table views, and other UI customizations.
  • Storage: This content is stored in the DbInternal.WorkspaceData internal table.
  • Resilience: Like other tables, WorkspaceData has both an intraday and a historical component. To ensure UI configurations are durable, merges must be run regularly on this table. This process consolidates the real-time changes into the historical, long-term storage. Failure to merge this table can result in the loss of recent UI changes if intraday data is lost.

Table data storage

Deephaven manages several types of table data, each with a distinct storage strategy.

Intraday data

  • Content: This includes real-time data from ticking tables and any intraday user tables created during a session.
  • Storage: Intraday data is written to local disk on the server where the data is processed (typically a merge server). It resides in directories like /db/Intraday and /db/IntradayUser.
  • Resilience: This data is transient by nature. Resilience is achieved through redundant ingestion strategies and, most importantly, by running end-of-day merges to write the data to durable long-term storage.

Historical and user table data

  • Content: This includes the output of merge operations (historical data) and any tables explicitly saved by users.
  • Storage: This data is stored in shared directories, such as /db/Systems and /db/User. These directories must be on a shared file system (like NFS) that is accessible from all query servers.
  • Resilience: The fault tolerance of this data is entirely dependent on the resilience of the underlying shared storage. It is critical that the storage solution (e.g., an NFS server or a cloud storage bucket) provides its own data protection, such as RAID, snapshots, or replication.

System files and libraries

Certain files are required on the local filesystem of each Deephaven server.

  • Content: This includes custom Java libraries (.jar files), Python packages, and local configuration files (e.g., under /etc/sysconfig/deephaven).
  • Storage: These are stored on the local filesystem of each server.
  • Resilience: To ensure consistency, it is best practice to manage these files using configuration management tools like Ansible, Puppet, or Chef. Alternatively, critical directories (like java_lib/) can be mounted from a shared, read-only location to guarantee that all servers use the same libraries.

Data lifecycle management

To maintain system health and prevent storage issues, internal Deephaven data requires regular lifecycle management through merging, validation, and deletion processes.

Critical tables requiring regular merges

Several internal tables are essential to merge and retain long-term:

  • AuditEventLog - Contains important events regarding system use, including access and table operations. Should be kept indefinitely on production servers.
  • PersistentQueryConfigurationLogV2 - Includes details of all PQ modifications and is used when reverting queries to previous versions.
  • PersistentQueryStateLog - Contains information about every PQ run, including exception details for failed queries. Critical for debugging failures.
  • WorkspaceData - Contains web server workspaces and dashboards. Must be retained for UI functionality.
  • ProcessEventLog - Contains detailed output from every PQ and Code Studio. Critical for issue analysis but large in size and typically useful for only a few days.

Performance monitoring tables

The following tables provide process and performance details but may not require long-term retention. Consider keeping data in intraday storage and deleting after a set period (e.g., 7 days):

Merge and validation procedures

For critical tables:

  1. Run nightly merges after all data for the previous day has been written.
  2. Validate merged data before deleting intraday data.
  3. Delete previous day's intraday data after successful validation.

These processes should be automated using PQs. See the merging data and validating data guides for implementation details.

Workspace data snapshots

Deephaven provides a tool to create snapshots of the WorkspaceData table to the WorkspaceDataSnapshot table. This optimization allows the web server to avoid scanning the entire WorkspaceData table when discovering user web data.

As of the Grizzly release (1.20240517), a PQ called WorkspaceSnapshot automatically runs this tool nightly. Monitor this query for successful completion alongside merge and validation queries.

Summary of best practices

  • Use a 3+ node etcd cluster to protect core cluster configuration.
  • Use resilient shared storage (e.g., NFS, S3) for historical and user table data, and ensure the storage itself is fault-tolerant.
  • Merge data regularly, especially the DbInternal.WorkspaceData table, to protect UI configurations and intraday data.
  • Automate system file consistency across nodes using configuration management tools or shared mounts.
  • Monitor Workspace snapshot creation for optimal web server performance.