Data and configuration storage
Protecting both critical operational data and system configuration is fundamental to ensuring the resilience of a Deephaven deployment. A robust strategy involves understanding where different types of data and configuration are stored, their respective lifecycles, and the mechanisms available to safeguard them against loss or corruption. This document details Deephaven's approach to data and configuration storage, covering centralized cluster settings, application-specific UI configurations, various forms of table data (intraday, historical, user-saved), and essential system files. By outlining these storage strategies and their associated resilience mechanisms, this guide provides administrators with the knowledge to implement comprehensive data protection and ensure consistent system behavior.
Centralized cluster configuration
The core configuration of a Deephaven cluster is managed centrally to ensure consistency across all nodes.
- Content: This includes Persistent Query (PQ) definitions, user and group information, table routing settings, and other critical system settings.
- Storage: This information is stored in
etcd
, a distributed key-value store. - Resilience:
etcd
provides automatic fault tolerance. When deployed with an odd number of nodes (three or more), anetcd
cluster can withstand node failures while maintaining availability and data consistency.
Application and UI configuration
User-specific configurations, such as web dashboards and UI layouts, are stored in a Deephaven table.
- Content: User-created dashboards, table views, and other UI customizations.
- Storage: This content is stored in the
DbInternal.WorkspaceData
internal table. - Resilience: Like other tables,
WorkspaceData
has both an intraday and a historical component. To ensure UI configurations are durable, merges must be run regularly on this table. This process consolidates the real-time changes into the historical, long-term storage. Failure to merge this table can result in the loss of recent UI changes if intraday data is lost.
Table data storage
Deephaven manages several types of table data, each with a distinct storage strategy.
Intraday data
- Content: This includes real-time data from ticking tables and any intraday user tables created during a session.
- Storage: Intraday data is written to local disk on the server where the data is processed (typically a merge server). It resides in directories like
/db/Intraday
and/db/IntradayUser
. - Resilience: This data is transient by nature. Resilience is achieved through redundant ingestion strategies and, most importantly, by running end-of-day merges to write the data to durable long-term storage.
Historical and user table data
- Content: This includes the output of merge operations (historical data) and any tables explicitly saved by users.
- Storage: This data is stored in shared directories, such as
/db/Systems
and/db/User
. These directories must be on a shared file system (like NFS) that is accessible from all query servers. - Resilience: The fault tolerance of this data is entirely dependent on the resilience of the underlying shared storage. It is critical that the storage solution (e.g., an NFS server or a cloud storage bucket) provides its own data protection, such as RAID, snapshots, or replication.
System files and libraries
Certain files are required on the local filesystem of each Deephaven server.
- Content: This includes custom Java libraries (
.jar
files), Python packages, and local configuration files (e.g., under/etc/sysconfig/deephaven
). - Storage: These are stored on the local filesystem of each server.
- Resilience: To ensure consistency, it is best practice to manage these files using configuration management tools like Ansible, Puppet, or Chef. Alternatively, critical directories (like
java_lib/
) can be mounted from a shared, read-only location to guarantee that all servers use the same libraries.
Data lifecycle management
To maintain system health and prevent storage issues, internal Deephaven data requires regular lifecycle management through merging, validation, and deletion processes.
Critical tables requiring regular merges
Several internal tables are essential to merge and retain long-term:
AuditEventLog
- Contains important events regarding system use, including access and table operations. Should be kept indefinitely on production servers.PersistentQueryConfigurationLogV2
- Includes details of all PQ modifications and is used when reverting queries to previous versions.PersistentQueryStateLog
- Contains information about every PQ run, including exception details for failed queries. Critical for debugging failures.WorkspaceData
- Contains web server workspaces and dashboards. Must be retained for UI functionality.ProcessEventLog
- Contains detailed output from every PQ and Code Studio. Critical for issue analysis but large in size and typically useful for only a few days.
Performance monitoring tables
The following tables provide process and performance details but may not require long-term retention. Consider keeping data in intraday storage and deleting after a set period (e.g., 7 days):
ProcessInfo
ProcessMetrics
ProcessTelemetry
QueryOperationPerformanceLogCoreV2
QueryPerformanceLogCoreV2
QueryUserAssignmentLog
ResourceUtilization
ServerStateLog
UpdatePerformanceLogCoreV2
Merge and validation procedures
For critical tables:
- Run nightly merges after all data for the previous day has been written.
- Validate merged data before deleting intraday data.
- Delete previous day's intraday data after successful validation.
These processes should be automated using PQs. See the merging data and validating data guides for implementation details.
Workspace data snapshots
Deephaven provides a tool to create snapshots of the WorkspaceData
table to the WorkspaceDataSnapshot
table. This optimization allows the web server to avoid scanning the entire WorkspaceData
table when discovering user web data.
As of the Grizzly release (1.20240517), a PQ called WorkspaceSnapshot
automatically runs this tool nightly. Monitor this query for successful completion alongside merge and validation queries.
Summary of best practices
- Use a 3+ node
etcd
cluster to protect core cluster configuration. - Use resilient shared storage (e.g., NFS, S3) for historical and user table data, and ensure the storage itself is fault-tolerant.
- Merge data regularly, especially the
DbInternal.WorkspaceData
table, to protect UI configurations and intraday data. - Automate system file consistency across nodes using configuration management tools or shared mounts.
- Monitor Workspace snapshot creation for optimal web server performance.