Load Balancing

Deephaven Enterprise uses different strategies to distribute work across components and handle redundancy. This guide explains how each component handles load distribution and provides configuration references for tuning behavior.

Load balancing vs high availability: This page focuses on how requests and work are distributed during normal operation. For failover behavior and recovery procedures, see Services and dependencies.

Quick reference

ComponentTypeMechanismConfigurable
Query ReplicasRound-robinAssignmentPolicyYes
DispatchersHeap utilizationSimpleServerSelectionProviderYes
Auth ServergRPC client-sideround_robin (default)Yes
Config ServergRPC client-sidepick_first (default)Yes
ControllerLeader election (high availability (HA) only)etcd election APIYes
DISFailover groupsRouting configYes
EnvoyExternal proxyxDS dynamic routingYes
TDCP / LASPer-serverN/AN/A
ACL Write ServerNoneN/AN/A

Query replicas

Persistent Queries can be configured with multiple replicas to distribute user load across identical copies. Users are assigned to replica slots based on an assignment policy.

How it works:

  • Default policy: round-robin assignment on first user access.
  • Assignments stored in Controller memory (not persisted across restarts).
  • Adding/removing replicas triggers automatic rebalancing.
  • Custom policies can be implemented via the AssignmentPolicy interface.

Configuration:

PropertyDescriptionDefault
PersistentQueryController.AssignmentPolicy.<name>.classFully qualified class nameio.deephaven.enterprise.controller.assignment.RoundRobinAssignmentPolicy
PersistentQueryController.AssignmentPolicy.<name>.displayNameUI display name (required)Round Robin
AssignmentPolicy.defaultPolicyPolicy <name> to use when none specifiedRoundRobin

Monitoring: Query the QueryUserAssignmentLog internal table to track user-to-slot assignments over time.

See Persistent Query redundancy and failover for complete configuration and sizing guidance.

Dispatcher selection

When a Persistent Query or console starts, the Controller selects a dispatcher (query server or merge server) using a Server Selection Provider. The default SimpleServerSelectionProvider uses heap utilization to balance load.

How it works:

  • Compares heap utilization percentage across servers in a group.
  • Tiebreaker: number of running workers.
  • For replicated PQs: avoids concentrating all replicas on one server.
  • Failure backoff policy prevents repeated assignment to failing servers.

Configuration:

PropertyDescriptionDefault
PersistentQueryController.ServerSelectionProviderProvider classcom.illumon.iris.controller.SimpleServerSelectionProvider
SimpleServerSelectionProvider.ActiveGroupsComma-separated group namesAutoQuery,AutoMerge
SimpleServerSelectionProvider.Group.<name>.ServerClassServer class for groupQuery
SimpleServerSelectionProvider.Group.<name>.ServersSpecific servers (optional)All in class
SimpleServerSelectionProvider.FailureBackoffPolicyBackoff behaviorNONE

Monitoring: Use dhconfig pq selection-provider to view current server availability and usage.

See Automated server selection for complete configuration.

Authentication server

Clients connect to authentication servers using gRPC with client-side load balancing. Auth servers register themselves in etcd for dynamic discovery.

How it works:

  • Auth servers register at startup via etcd service discovery.
  • Clients use ServiceResolver to discover available servers.
  • Default policy: round_robin — distributes requests across all available servers.

Configuration:

PropertyDescriptionDefault
authentication.server.clientLoadBalancingPolicygRPC load balancing policyround_robin
authentication.server.rearrangeHostsShuffle host order, prioritizing localhosttrue

Monitoring: Check auth server logs for connection counts and authentication request distribution.

Configuration server

Clients connect to configuration servers using gRPC with configurable load balancing. Multiple configuration servers can be specified for redundancy.

How it works:

  • Clients receive a list of configuration server addresses.
  • Default policy: pick_first — connects to first available, fails over on disconnect.
  • Different timeouts: 30s for single server, 6s for multiple (faster failover).

Configuration:

PropertyDescriptionDefault
configuration.server.load.balancer.policygRPC load balancing policypick_first
configuration.server.rearrange.hostsShuffle host ordertrue
configuration.server.connection.timeout.millis.singleConnection timeout (single server)30000
configuration.server.connection.timeout.millis.multiConnection timeout (multiple servers)6000

See Configuration server client properties for complete reference.

Controller

The Controller uses leader election rather than load balancing. Only one Controller instance is active at any time; additional instances wait to become leader.

How it works:

  • Controller starts a campaign for leadership via etcd's Election API.
  • The campaign() call blocks until leadership is acquired.
  • Non-leaders either wait (blocked in campaign) or exit if election fails.
  • Only the leader starts its gRPC server and registers in etcd for service discovery.
  • On leadership loss: Controller shuts down; orchestration (monit/k8s) restarts it as a new candidate.

Configuration:

PropertyDescriptionDefault
PersistentQueryController.etcdElectionLeaseTtlSecondsLease time-to-live15
PersistentQueryController.etcdElectionLeaseRenewSecondsLease renewal interval7

Monitoring: Check Controller logs for election status. The leader registers in etcd at /controller/election/campaign.

Important

Controller "followers" do not handle any client requests — they are not running services. Processes waiting for leadership are blocked in the election call and have not started their gRPC server. All clients connect to the single active leader discovered via etcd service registration.

Data Import Server (DIS)

DIS uses failover groups for redundancy, not load balancing. All DIS instances in a failover group are active, but queries typically connect to the same one.

How it works:

  • Multiple DIS instances can be configured in a failover group.
  • All instances are active and available; selection is arbitrary but tends to be consistent.
  • Table data service automatically redirects to an available instance on failure.
  • Each DIS must be sized to handle the full query load (no load distribution).
  • Each DIS needs its own data storage; all DISes within a failover group must handle the same data (identical claims/filters).

Configuration: Failover groups are defined in routing_service.yml. See Failover groups for YAML examples.

Monitoring: Check DIS logs and the ProcessEventLog for connection and failover events.

Per-server components

Some components run one instance per server by design. Load is distributed across these components to the extent that work is distributed across servers.

ComponentPurposeNotes
TDCPCaches table data for local workersAll workers on a host share one TDCP
LASAggregates logs from local processesOne per host in bare-metal deployments; see also primary LAS below
TailersStream data from binary logs to DISCan split by data source, but rarely needed; scales well as a single process

These components do not have explicit load balancing configuration. Distributing Persistent Queries across multiple servers (via automated server selection) naturally distributes load across these per-server components.

Single-instance services

Some services are designed as single instances and do not support load balancing:

  • ACL Write Server: Single writer to ACL database. Other services read ACLs directly from the ACL store (etcd or MySQL); etcd replication provides HA for the store, not load distribution.
  • Primary LAS: Writes to centrally-appended tables (including input tables) go through a single LAS instance.
  • Status Dashboard: Observational service for metrics collection.
  • Web API Service: Serves the web UI and API requests.

Loss of these services does not affect query execution. See Services and dependencies for impact assessment.

Envoy front proxy

Envoy serves as an optional front proxy that provides external load balancing, SSL/TLS termination, and intelligent routing to Deephaven services.

How it works:

  • Exposes a single external port for all client traffic.
  • Uses Deephaven's xDS (dynamic configuration service) to discover backend services automatically.
  • Routes requests to appropriate services: Web UI, workers, APIs.
  • Provides connection-level load balancing across service instances.
  • Handles SSL/TLS termination, offloading encryption from backend services.

Routing and load balancing:

  • Web UI and API requests: Routed to the Web API service.
  • Worker connections: Routed to individual workers based on session/query.
  • Configuration: Dynamically updated via xDS from the Configuration Server.

Configuration (cluster.cnf flags):

FlagDescriptionDefault
DH_CONFIGURE_ENVOYEnable Envoy integrationfalse
DH_ENVOY_FQDNExternal hostname for Envoy
DH_ENVOY_PORTExternal port for client connections8000

Configuration (Deephaven property):

PropertyDescriptionDefault
envoy.xds.portPort for xDS communication8124

Monitoring: Use the Envoy admin interface (when enabled) to check cluster health at /clusters and routing configuration at /config_dump.

Note

Envoy provides external load balancing at the network edge. Internal load balancing between Deephaven components (dispatchers, auth servers, etc.) is handled by the mechanisms described in the sections above.

See Configuring Envoy for complete setup instructions.

Kubernetes considerations

Some load balancing behaviors differ in Kubernetes deployments:

  • TDCP: Single instance shared by all workers.
  • Dispatchers: Single query dispatcher and single merge dispatcher; no selection among multiple instances.

See Kubernetes configuration settings for deployment-specific guidance.