Services and dependencies
A resilient Deephaven cluster is built upon a collection of interdependent services, each vital to the real-time data platform's operation. Understanding the role of each service within the cluster, its dependencies, and its specific contributions to high availability is crucial for comprehensive resilience planning. This document outlines these key services, their functions, and their HA mechanisms.
Core infrastructure services
These services form the backbone of the Deephaven cluster, managing state, configuration, and worker processes.
etcd
- Role:
etcd
is a distributed key-value store that serves as the central source of truth for the entire cluster. It stores configuration, persistent query definitions, and the current state of workers. - High Availability:
etcd
is inherently fault-tolerant. A cluster with an odd number of nodes (three or more) can automatically withstand the loss of nodes while maintaining availability. In a typical production Deephaven installation with three nodes,etcd
runs on each node, allowing the cluster to survive the loss of any single node without impactingetcd
's operation. Three nodes is the minimum recommended configuration for production high availability. - Configuration: See etcd configuration for detailed setup and management.
Controller
- Role: The Controller is responsible for managing the lifecycle of persistent queries (PQs) and the workers that execute them.
- High Availability: Only one Controller process acts as the leader at any given time. Additional Controller instances can be configured as hot-spares. If the active leader fails to renew its
etcd
lease (due to a crash or network issue), the remaining spares participate in a leader election, and a new leader is chosen to take over. - Configuration: See Persistent Query Controller for detailed configuration and management.
Authentication and configuration services
- Role: The
authentication_server
handles user authentication, while theconfiguration_server
provides configuration settings to other services. - High Availability: Multiple authentication server instances can be run for fault tolerance, with clients configured to connect to any available instance. The configuration server typically runs as a single instance and is not designed for multi-instance deployment.
- Configuration: See Authentication service and Configuration server for detailed setup.
Data processing and query services
These services are the workhorses of the cluster, responsible for executing user code and managing data.
Query servers
- Role: Query servers execute user queries, including scripts, ticking tables, and other computations.
- High Availability: Query servers are stateless compute resources; they do not own any data. If a query server fails, the queries that were running on it can be restarted on other available query servers. A common HA strategy is the n-1 approach: provisioning enough query servers to handle the full production workload even if one server is lost.
- Best practices: Use automated server selection to ensure Persistent Queries are not configured to run in only one place. Configure PQ scheduling for automatic restart of failed queries. Deploy Persistent Query replicas and spares for redundancy.
Merge servers
- Role: Merge servers are responsible for data ingestion and running merge routines that combine intraday data into historical tables.
- High Availability: While individual merge servers are not fault-tolerant, redundancy is achieved by running multiple instances that can be assigned different parts of a workflow.
- Best practices: For Persistent Query failover, similar mechanisms can be used for merge servers as for query servers if multiple merge servers are available.
Data ingestion and access services
These services are responsible for bringing data into Deephaven from external sources.
Data Import Server (DIS)
- Role: The DIS handles ticking data streams, writing events to disk and publishing them to clients.
- High Availability: Redundant DIS instances can be configured into a failover group in the data routing setup, allowing for both round-robin load balancing and automatic failover.
Tailer
- Role: A Tailer reads from binary log files (often produced by custom logger processes) and streams the data to a DIS.
- High Availability: An individual tailer is not fault-tolerant, but it checkpoints its read position. If a tailer fails, a new one can be started, and it will resume from the last checkpoint. Redundancy can also be achieved by having multiple tailers process the same data stream to different destinations.
- Configuration: See Tailer configuration for detailed setup and management.
Local table data service (LTDS)
- Role: In some Deephaven installations, the LTDS provides access to table data on a specific server.
- High Availability: Redundant LTDSs can be configured for round-robin and failover in the data routing setup.
Failure analysis and troubleshooting
For comprehensive analysis of failure scenarios and step-by-step troubleshooting procedures, see:
- Failure modes - Detailed categorization of hardware, software, configuration, and environmental failure scenarios
- Process startup troubleshooting - Diagnosing service startup issues and log analysis
- etcd recovery procedures - Critical infrastructure recovery steps
- System troubleshooting guides - Component-specific troubleshooting for controllers, certificates, networking, and more
Supporting and optional services
web_api_service
: This service is not fault-tolerant. Its loss impacts Web and OpenAPI access.db_acl_writer
: This service is not fault-tolerant. Its loss prevents changes to database permissions and accounts. See ACL Troubleshooting.- Envoy: An optional third-party proxy. It can be configured as a cluster for high availability and is often run in containers for easy replacement.