Scaling within a server

This page covers high-level considerations when scaling up a Deephaven system within a server.

Backend services

The following is a list of services that may or may not be running on any given Deephaven node. This will be based off of the ROLE flags set during installation. The one exception is the WebClientData service, which is a Persistent Query.

Service NameScaling Concern
authentication_serverDoes not often need to be scaled.
db_acl_write_serverDoes not often need to be scaled.
db_tdcpOften needs to scale up.
configuration_serverDoes not often need to be scaled.
iris_controllerDoes not often need to be scaled.
log_aggregator_serviceSometimes needs to scale up.
tailer1Sometimes needs to scale up.
db_disOften needs to scale up.
db_merge_serverDoes not often need to be scaled.
db_query_serverDoes not often need to be scaled.
web_api_serviceDoes not often need to be scaled.
WebClientData*Often needs to scale up.

It is recommended to monitor the heap usage of these services. Each service will periodically log their heap usage with a line like so:

Increasing heap or adjusting the parameters of a service is done through the hostconfig.

Live Data and the DIS

How it works

The most common and recommended pattern for live data in Deephaven involves using a Data Import Server (DIS) that processes data streams from Deephaven loggers and tailers. The DIS will persist data to local disk, and provide updates to subscribing workers.

When a user queries a live table using db.liveTable():

  1. The worker looks up the source of the data in the data routing configuration.
    • By default, the query worker will look for the table data in the server's Table Data Cache Proxy (TDCP).
    • The TDCP caches table data in its process memory. If the requested data is available in memory, the TDCP immediately serves it to the query worker using the table data protocol (TDP) — a custom binary format delivered over TCP.
  2. If the TDCP does not have the table data in memory, it will forward the request to the appropriate DIS (again by the routing configuration).
    • This DIS may live on another node in the cluster. The DIS will either have the data in its own memory cache, or it will read the table data from disk (note that, unless your query requires it, it will not read the full table, but a specified viewport) and deliver it to the TDCP over TDP.
  3. The TDCP then forwards this to the worker over TDP.
    • This is true whether your query is running on the DIS node or on another node in the cluster.

More details on the live data process can be found here: Communications - Data Import Server

Tailer

The tailer will tail data from the binary logs to the DIS.

Loaded by — The number of connections the tailer has open and the number of files it is tracking at any given point. Therefore, it does not necessarily need to scale with data volume, but instead with the number of binlogs that are being tailed to the number of DIS(s) at any one time.

To scale — Increase the heap or increase the direct memory. You can also add additional tailers to separate workloads.

DIS

The DIS is responsible for ingesting and serving live data. By default, a Deephaven cluster will have one main DIS, but you can add others as needed.

Loaded by — The volume of all data being ingested through the DIS. The DIS is further loaded by the amount of unique data reads per node. The TDCP handles repeated reads of the same data within a node.

To scale — Increase the max heap, but also increase its buffer pool. Both must be done through the hostconfig. You can also use data filters to partition data across multiple DIS instances. A common use case is that many teams like to separate out the System data from the User data, which you can do by configuring a new DIS.

TDCP

The TDCP is a caching layer for the DIS.

Loaded by — The amount of data being read on that node. Each node has its own TDCP.

To scale — Scaling the TDCP means increasing the max heap.

The UI and WebClientData

The WebClientData PQ is responsible for serving workspace data, including dashboards and notebooks, from internal Deephaven tables to users.

Loaded by — The amount of loading and saving of workspaces/dashboards being done by users on the web UI.

To scale — Increase the heap for the PQ via the UI. It is also recommended to run the snapshotting PQ at least once a week. In Grizzly, this PQ is enabled by default. In some cases, we have also seen benefits to a nightly restart.

User Data and the LAS

Users have the ability to create user tables either directly or centrally.

Importing user table data "directly" means the worker will save the data directly on disk under /db/Users/. If this directory is on NFS, this data will be available to any worker on any node. This does not load any of the live services.

Importing user table data "live" will use the LAS on the Infrastructure node to write out the binary log files. When those are written out, the data gets tailed into the DIS and is distributed like a live system table.

Loaded by — The LAS on the infra server will be loaded by the number of connections made from the workers to the LAS. This can happen if someone does a number of appendLiveTable calls rapidly or calls appendLiveTableIncremental many times without closing the connection. There are protections in place to prevent the LAS from being overloaded like this.

The LAS on each node also logs the DbInternal tables and is further loaded by the number of active worker connections.

To scale — Increase the heap.