Scaling within a server
This page covers high-level considerations when scaling up a Deephaven system within a server.
Backend services
The following is a list of services that may or may not be running on any given Deephaven node. This will be based off of the ROLE flags set during installation. The one exception is the WebClientData service, which is a Persistent Query.
| Service Name | Scaling Concern |
|---|---|
authentication_server | Does not often need to be scaled. |
db_acl_write_server | Does not often need to be scaled. |
db_tdcp | Often needs to scale up. |
configuration_server | Does not often need to be scaled. |
iris_controller | Does not often need to be scaled. |
log_aggregator_service | Sometimes needs to scale up. |
tailer1 | Sometimes needs to scale up. |
db_dis | Often needs to scale up. |
db_merge_server | Does not often need to be scaled. |
db_query_server | Does not often need to be scaled. |
web_api_service | Does not often need to be scaled. |
WebClientData* | Often needs to scale up. |
It is recommended to monitor the heap usage of these services. Each service will periodically log their heap usage with a line like so:
Increasing heap or adjusting the parameters of a service is done through the hostconfig.
Live Data and the DIS
How it works
The most common and recommended pattern for live data in Deephaven involves using a Data Import Server (DIS) that processes data streams from Deephaven loggers and tailers. The DIS will persist data to local disk, and provide updates to subscribing workers.
When a user queries a live table using db.liveTable():
- The worker looks up the source of the data in the data routing configuration.
- By default, the query worker will look for the table data in the server's Table Data Cache Proxy (TDCP).
- The TDCP caches table data in its process memory. If the requested data is available in memory, the TDCP immediately serves it to the query worker using the table data protocol (TDP) — a custom binary format delivered over TCP.
- If the TDCP does not have the table data in memory, it will forward the request to the appropriate DIS (again by the routing configuration).
- This DIS may live on another node in the cluster. The DIS will either have the data in its own memory cache, or it will read the table data from disk (note that, unless your query requires it, it will not read the full table, but a specified viewport) and deliver it to the TDCP over TDP.
- The TDCP then forwards this to the worker over TDP.
- This is true whether your query is running on the DIS node or on another node in the cluster.
More details on the live data process can be found here: Communications - Data Import Server
Tailer
The tailer will tail data from the binary logs to the DIS.
Loaded by — The number of connections the tailer has open and the number of files it is tracking at any given point. Therefore, it does not necessarily need to scale with data volume, but instead with the number of binlogs that are being tailed to the number of DIS(s) at any one time.
To scale — Increase the heap or increase the direct memory. You can also add additional tailers to separate workloads.
DIS
The DIS is responsible for ingesting and serving live data. By default, a Deephaven cluster will have one main DIS, but you can add others as needed.
Loaded by — The volume of all data being ingested through the DIS. The DIS is further loaded by the amount of unique data reads per node. The TDCP handles repeated reads of the same data within a node.
To scale — Increase the max heap, but also increase its buffer pool. Both must be done through the hostconfig. You can also use data filters to partition data across multiple DIS instances. A common use case is that many teams like to separate out the System data from the User data, which you can do by configuring a new DIS.
TDCP
The TDCP is a caching layer for the DIS.
Loaded by — The amount of data being read on that node. Each node has its own TDCP.
To scale — Scaling the TDCP means increasing the max heap.
The UI and WebClientData
The WebClientData PQ is responsible for serving workspace data, including dashboards and notebooks, from internal Deephaven tables to users.
Loaded by — The amount of loading and saving of workspaces/dashboards being done by users on the web UI.
To scale — Increase the heap for the PQ via the UI. It is also recommended to run the snapshotting PQ at least once a week. In Grizzly, this PQ is enabled by default. In some cases, we have also seen benefits to a nightly restart.
User Data and the LAS
Users have the ability to create user tables either directly or centrally.
Importing user table data "directly" means the worker will save the data directly on disk under /db/Users/. If this directory is on NFS, this data will be available to any worker on any node. This does not load any of the live services.
Importing user table data "live" will use the LAS on the Infrastructure node to write out the binary log files. When those are written out, the data gets tailed into the DIS and is distributed like a live system table.
Loaded by — The LAS on the infra server will be loaded by the number of connections made from the workers to the LAS. This can happen if someone does a number of appendLiveTable calls rapidly or calls appendLiveTableIncremental many times without closing the connection. There are protections in place to prevent the LAS from being overloaded like this.
The LAS on each node also logs the DbInternal tables and is further loaded by the number of active worker connections.
To scale — Increase the heap.