Load Balancing
Deephaven Enterprise uses different strategies to distribute work across components and handle redundancy. This guide explains how each component handles load distribution and provides configuration references for tuning behavior.
Load balancing vs high availability: This page focuses on how requests and work are distributed during normal operation. For failover behavior and recovery procedures, see Services and dependencies.
Quick reference
| Component | Type | Mechanism | Configurable |
|---|---|---|---|
| Query Replicas | Round-robin | AssignmentPolicy | Yes |
| Dispatchers | Heap utilization | SimpleServerSelectionProvider | Yes |
| Auth Server | gRPC client-side | round_robin (default) | Yes |
| Config Server | gRPC client-side | pick_first (default) | Yes |
| Controller | Leader election (high availability (HA) only) | etcd election API | Yes |
| DIS | Failover groups | Routing config | Yes |
| Envoy | External proxy | xDS dynamic routing | Yes |
| TDCP / LAS | Per-server | N/A | N/A |
| ACL Write Server | None | N/A | N/A |
Query replicas
Persistent Queries can be configured with multiple replicas to distribute user load across identical copies. Users are assigned to replica slots based on an assignment policy.
How it works:
- Default policy: round-robin assignment on first user access.
- Assignments stored in Controller memory (not persisted across restarts).
- Adding/removing replicas triggers automatic rebalancing.
- Custom policies can be implemented via the
AssignmentPolicyinterface.
Configuration:
| Property | Description | Default |
|---|---|---|
PersistentQueryController.AssignmentPolicy.<name>.class | Fully qualified class name | io.deephaven.enterprise.controller.assignment.RoundRobinAssignmentPolicy |
PersistentQueryController.AssignmentPolicy.<name>.displayName | UI display name (required) | Round Robin |
AssignmentPolicy.defaultPolicy | Policy <name> to use when none specified | RoundRobin |
Monitoring: Query the QueryUserAssignmentLog internal table to track user-to-slot assignments over time.
See Persistent Query redundancy and failover for complete configuration and sizing guidance.
Dispatcher selection
When a Persistent Query or console starts, the Controller selects a dispatcher (query server or merge server) using a Server Selection Provider. The default SimpleServerSelectionProvider uses heap utilization to balance load.
How it works:
- Compares heap utilization percentage across servers in a group.
- Tiebreaker: number of running workers.
- For replicated PQs: avoids concentrating all replicas on one server.
- Failure backoff policy prevents repeated assignment to failing servers.
Configuration:
| Property | Description | Default |
|---|---|---|
PersistentQueryController.ServerSelectionProvider | Provider class | com.illumon.iris.controller.SimpleServerSelectionProvider |
SimpleServerSelectionProvider.ActiveGroups | Comma-separated group names | AutoQuery,AutoMerge |
SimpleServerSelectionProvider.Group.<name>.ServerClass | Server class for group | Query |
SimpleServerSelectionProvider.Group.<name>.Servers | Specific servers (optional) | All in class |
SimpleServerSelectionProvider.FailureBackoffPolicy | Backoff behavior | NONE |
Monitoring: Use dhconfig pq selection-provider to view current server availability and usage.
See Automated server selection for complete configuration.
Authentication server
Clients connect to authentication servers using gRPC with client-side load balancing. Auth servers register themselves in etcd for dynamic discovery.
How it works:
- Auth servers register at startup via etcd service discovery.
- Clients use
ServiceResolverto discover available servers. - Default policy:
round_robin— distributes requests across all available servers.
Configuration:
| Property | Description | Default |
|---|---|---|
authentication.server.clientLoadBalancingPolicy | gRPC load balancing policy | round_robin |
authentication.server.rearrangeHosts | Shuffle host order, prioritizing localhost | true |
Monitoring: Check auth server logs for connection counts and authentication request distribution.
Configuration server
Clients connect to configuration servers using gRPC with configurable load balancing. Multiple configuration servers can be specified for redundancy.
How it works:
- Clients receive a list of configuration server addresses.
- Default policy:
pick_first— connects to first available, fails over on disconnect. - Different timeouts: 30s for single server, 6s for multiple (faster failover).
Configuration:
| Property | Description | Default |
|---|---|---|
configuration.server.load.balancer.policy | gRPC load balancing policy | pick_first |
configuration.server.rearrange.hosts | Shuffle host order | true |
configuration.server.connection.timeout.millis.single | Connection timeout (single server) | 30000 |
configuration.server.connection.timeout.millis.multi | Connection timeout (multiple servers) | 6000 |
See Configuration server client properties for complete reference.
Controller
The Controller uses leader election rather than load balancing. Only one Controller instance is active at any time; additional instances wait to become leader.
How it works:
- Controller starts a campaign for leadership via etcd's Election API.
- The
campaign()call blocks until leadership is acquired. - Non-leaders either wait (blocked in campaign) or exit if election fails.
- Only the leader starts its gRPC server and registers in etcd for service discovery.
- On leadership loss: Controller shuts down; orchestration (monit/k8s) restarts it as a new candidate.
Configuration:
| Property | Description | Default |
|---|---|---|
PersistentQueryController.etcdElectionLeaseTtlSeconds | Lease time-to-live | 15 |
PersistentQueryController.etcdElectionLeaseRenewSeconds | Lease renewal interval | 7 |
Monitoring: Check Controller logs for election status. The leader registers in etcd at /controller/election/campaign.
Important
Controller "followers" do not handle any client requests — they are not running services. Processes waiting for leadership are blocked in the election call and have not started their gRPC server. All clients connect to the single active leader discovered via etcd service registration.
Data Import Server (DIS)
DIS uses failover groups for redundancy, not load balancing. All DIS instances in a failover group are active, but queries typically connect to the same one.
How it works:
- Multiple DIS instances can be configured in a failover group.
- All instances are active and available; selection is arbitrary but tends to be consistent.
- Table data service automatically redirects to an available instance on failure.
- Each DIS must be sized to handle the full query load (no load distribution).
- Each DIS needs its own data storage; all DISes within a failover group must handle the same data (identical claims/filters).
Configuration: Failover groups are defined in routing_service.yml. See Failover groups for YAML examples.
Monitoring: Check DIS logs and the ProcessEventLog for connection and failover events.
Per-server components
Some components run one instance per server by design. Load is distributed across these components to the extent that work is distributed across servers.
| Component | Purpose | Notes |
|---|---|---|
| TDCP | Caches table data for local workers | All workers on a host share one TDCP |
| LAS | Aggregates logs from local processes | One per host in bare-metal deployments; see also primary LAS below |
| Tailers | Stream data from binary logs to DIS | Can split by data source, but rarely needed; scales well as a single process |
These components do not have explicit load balancing configuration. Distributing Persistent Queries across multiple servers (via automated server selection) naturally distributes load across these per-server components.
Single-instance services
Some services are designed as single instances and do not support load balancing:
- ACL Write Server: Single writer to ACL database. Other services read ACLs directly from the ACL store (etcd or MySQL); etcd replication provides HA for the store, not load distribution.
- Primary LAS: Writes to centrally-appended tables (including input tables) go through a single LAS instance.
- Status Dashboard: Observational service for metrics collection.
- Web API Service: Serves the web UI and API requests.
Loss of these services does not affect query execution. See Services and dependencies for impact assessment.
Envoy front proxy
Envoy serves as an optional front proxy that provides external load balancing, SSL/TLS termination, and intelligent routing to Deephaven services.
How it works:
- Exposes a single external port for all client traffic.
- Uses Deephaven's xDS (dynamic configuration service) to discover backend services automatically.
- Routes requests to appropriate services: Web UI, workers, APIs.
- Provides connection-level load balancing across service instances.
- Handles SSL/TLS termination, offloading encryption from backend services.
Routing and load balancing:
- Web UI and API requests: Routed to the Web API service.
- Worker connections: Routed to individual workers based on session/query.
- Configuration: Dynamically updated via xDS from the Configuration Server.
Configuration (cluster.cnf flags):
| Flag | Description | Default |
|---|---|---|
DH_CONFIGURE_ENVOY | Enable Envoy integration | false |
DH_ENVOY_FQDN | External hostname for Envoy | — |
DH_ENVOY_PORT | External port for client connections | 8000 |
Configuration (Deephaven property):
| Property | Description | Default |
|---|---|---|
envoy.xds.port | Port for xDS communication | 8124 |
Monitoring: Use the Envoy admin interface (when enabled) to check cluster health at /clusters and routing configuration at /config_dump.
Note
Envoy provides external load balancing at the network edge. Internal load balancing between Deephaven components (dispatchers, auth servers, etc.) is handled by the mechanisms described in the sections above.
See Configuring Envoy for complete setup instructions.
Kubernetes considerations
Some load balancing behaviors differ in Kubernetes deployments:
- TDCP: Single instance shared by all workers.
- Dispatchers: Single query dispatcher and single merge dispatcher; no selection among multiple instances.
See Kubernetes configuration settings for deployment-specific guidance.