Load Balancing

Deephaven Enterprise uses different strategies to distribute work across components and handle redundancy. This guide explains how each component handles load distribution and provides configuration references for tuning behavior.

Load balancing vs high availability: This page focuses on how requests and work are distributed during normal operation. For failover behavior and recovery procedures, see Services and dependencies.

Quick reference

Component	Type	Mechanism	Configurable
Query Replicas	Round-robin	`AssignmentPolicy`	Yes
Dispatchers	Heap utilization	`SimpleServerSelectionProvider`	Yes
Auth Server	gRPC client-side	`round_robin` (default)	Yes
Config Server	gRPC client-side	`pick_first` (default)	Yes
Controller	Leader election (high availability (HA) only)	etcd election API	Yes
DIS	Failover groups	Routing config	Yes
Envoy	External proxy	xDS dynamic routing	Yes
TDCP / LAS	Per-server	N/A	N/A
ACL Write Server	None	N/A	N/A

Query replicas

Persistent Queries can be configured with multiple replicas to distribute user load across identical copies. Users are assigned to replica slots based on an assignment policy.

How it works:

Default policy: round-robin assignment on first user access.
Assignments stored in Controller memory (not persisted across restarts).
Adding/removing replicas triggers automatic rebalancing.
Custom policies can be implemented via the AssignmentPolicy interface.

Configuration:

Property	Description	Default
`PersistentQueryController.AssignmentPolicy.<name>.class`	Fully qualified class name	`io.deephaven.enterprise.controller.assignment.RoundRobinAssignmentPolicy`
`PersistentQueryController.AssignmentPolicy.<name>.displayName`	UI display name (required)	`Round Robin`
`AssignmentPolicy.defaultPolicy`	Policy `<name>` to use when none specified	`RoundRobin`

Monitoring: Query the QueryUserAssignmentLog internal table to track user-to-slot assignments over time.

See Persistent Query redundancy and failover for complete configuration and sizing guidance.

Dispatcher selection

When a Persistent Query or console starts, the Controller selects a dispatcher (query server or merge server) using a Server Selection Provider. The default SimpleServerSelectionProvider uses heap utilization to balance load.

How it works:

Compares heap utilization percentage across servers in a group.
Tiebreaker: number of running workers.
For replicated PQs: avoids concentrating all replicas on one server.
Failure backoff policy prevents repeated assignment to failing servers.

Configuration:

Property	Description	Default
`PersistentQueryController.ServerSelectionProvider`	Provider class	`com.illumon.iris.controller.SimpleServerSelectionProvider`
`SimpleServerSelectionProvider.ActiveGroups`	Comma-separated group names	`AutoQuery,AutoMerge`
`SimpleServerSelectionProvider.Group.<name>.ServerClass`	Server class for group	`Query`
`SimpleServerSelectionProvider.Group.<name>.Servers`	Specific servers (optional)	All in class
`SimpleServerSelectionProvider.FailureBackoffPolicy`	Backoff behavior	`NONE`

Monitoring: Use dhconfig pq selection-provider to view current server availability and usage.

See Automated server selection for complete configuration.

Authentication server

Clients connect to authentication servers using gRPC with client-side load balancing. Auth servers register themselves in etcd for dynamic discovery.

How it works:

Auth servers register at startup via etcd service discovery.
Clients use ServiceResolver to discover available servers.
Default policy: round_robin — distributes requests across all available servers.

Configuration:

Property	Description	Default
`authentication.server.clientLoadBalancingPolicy`	gRPC load balancing policy	`round_robin`
`authentication.server.rearrangeHosts`	Shuffle host order, prioritizing localhost	`true`

Monitoring: Check auth server logs for connection counts and authentication request distribution.

Configuration server

Clients connect to configuration servers using gRPC with configurable load balancing. Multiple configuration servers can be specified for redundancy.

How it works:

Clients receive a list of configuration server addresses.
Default policy: pick_first — connects to first available, fails over on disconnect.
Different timeouts: 30s for single server, 6s for multiple (faster failover).

Configuration:

Property	Description	Default
`configuration.server.load.balancer.policy`	gRPC load balancing policy	`pick_first`
`configuration.server.rearrange.hosts`	Shuffle host order	`true`
`configuration.server.connection.timeout.millis.single`	Connection timeout (single server)	`30000`
`configuration.server.connection.timeout.millis.multi`	Connection timeout (multiple servers)	`6000`

See Configuration server client properties for complete reference.

Controller

The Controller uses leader election rather than load balancing. Only one Controller instance is active at any time; additional instances wait to become leader.

How it works:

Controller starts a campaign for leadership via etcd's Election API.
The campaign() call blocks until leadership is acquired.
Non-leaders either wait (blocked in campaign) or exit if election fails.
Only the leader starts its gRPC server and registers in etcd for service discovery.
On leadership loss: Controller shuts down; orchestration (monit/k8s) restarts it as a new candidate.

Configuration:

Property	Description	Default
`PersistentQueryController.etcdElectionLeaseTtlSeconds`	Lease time-to-live	`15`
`PersistentQueryController.etcdElectionLeaseRenewSeconds`	Lease renewal interval	`7`

Monitoring: Check Controller logs for election status. The leader registers in etcd at /controller/election/campaign.

Important

Controller "followers" do not handle any client requests — they are not running services. Processes waiting for leadership are blocked in the election call and have not started their gRPC server. All clients connect to the single active leader discovered via etcd service registration.

Data Import Server (DIS)

DIS uses failover groups for redundancy, not load balancing. All DIS instances in a failover group are active, but queries typically connect to the same one.

How it works:

Multiple DIS instances can be configured in a failover group.
All instances are active and available; selection is arbitrary but tends to be consistent.
Table data service automatically redirects to an available instance on failure.
Each DIS must be sized to handle the full query load (no load distribution).
Each DIS needs its own data storage; all DISes within a failover group must handle the same data (identical claims/filters).

Configuration: Failover groups are defined in routing_service.yml. See Failover groups for YAML examples.

Monitoring: Check DIS logs and the ProcessEventLog for connection and failover events.

Per-server components

Some components run one instance per server by design. Load is distributed across these components to the extent that work is distributed across servers.

Component	Purpose	Notes
TDCP	Caches table data for local workers	All workers on a host share one TDCP
LAS	Aggregates logs from local processes	One per host in bare-metal deployments; see also primary LAS below
Tailers	Stream data from binary logs to DIS	Can split by data source, but rarely needed; scales well as a single process

These components do not have explicit load balancing configuration. Distributing Persistent Queries across multiple servers (via automated server selection) naturally distributes load across these per-server components.

Single-instance services

Some services are designed as single instances and do not support load balancing:

ACL Write Server: Single writer to ACL database. Other services read ACLs directly from the ACL store (etcd or MySQL); etcd replication provides HA for the store, not load distribution.
Primary LAS: Writes to centrally-appended tables (including input tables) go through a single LAS instance.
Status Dashboard: Observational service for metrics collection.
Web API Service: Serves the web UI and API requests.

Loss of these services does not affect query execution. See Services and dependencies for impact assessment.

Envoy front proxy

Envoy serves as an optional front proxy that provides external load balancing, SSL/TLS termination, and intelligent routing to Deephaven services.

How it works:

Exposes a single external port for all client traffic.
Uses Deephaven's xDS (dynamic configuration service) to discover backend services automatically.
Routes requests to appropriate services: Web UI, workers, APIs.
Provides connection-level load balancing across service instances.
Handles SSL/TLS termination, offloading encryption from backend services.

Routing and load balancing:

Web UI and API requests: Routed to the Web API service.
Worker connections: Routed to individual workers based on session/query.
Configuration: Dynamically updated via xDS from the Configuration Server.

Configuration (cluster.cnf flags):

Flag	Description	Default
`DH_CONFIGURE_ENVOY`	Enable Envoy integration	`false`
`DH_ENVOY_FQDN`	External hostname for Envoy	—
`DH_ENVOY_PORT`	External port for client connections	`8000`

Configuration (Deephaven property):

Property	Description	Default
`envoy.xds.port`	Port for xDS communication	`8124`

Monitoring: Use the Envoy admin interface (when enabled) to check cluster health at /clusters and routing configuration at /config_dump.

Note

Envoy provides external load balancing at the network edge. Internal load balancing between Deephaven components (dispatchers, auth servers, etc.) is handled by the mechanisms described in the sections above.

See Configuring Envoy for complete setup instructions.

Kubernetes considerations

Some load balancing behaviors differ in Kubernetes deployments:

TDCP: Single instance shared by all workers.
Dispatchers: Single query dispatcher and single merge dispatcher; no selection among multiple instances.

See Kubernetes configuration settings for deployment-specific guidance.

Load Balancing

Quick reference

Query replicas

Dispatcher selection

Authentication server

Configuration server

Controller

Data Import Server (DIS)

Per-server components

Single-instance services

Envoy front proxy

Kubernetes considerations

Related documentation