Operation Deephaven 24x6

All Deephaven services are designed to run as long-lived processes. However, most queries must be restarted daily to initialize using newly available historical data. Additionally, some services must be restarted when database tables are added or modified, as well as for other configuration changes.

In particular, specific Deephaven processes must be run or restarted after the following events:

  • The import and merge processes for historical tables must be started once the source data is ready. (These often occur at different times, depending on the data source.)
  • Most persistent queries must be restarted when new historical data is available (typically daily). The Data Import Server must be restarted when table definitions or listeners of real-time tables are changed.
  • The log tailers must be restarted when the log tailer XML configuration file changes (for example, to add additional tables to the tailer).
  • If a property file changes, all processes that require the changed property must be restarted.

When a Deephaven installation serves only a single region, it is typical for the system to reach a quiescent state overnight, at which point any pending configuration or schema changes are deployed, and historical data ingestion begins. To support continuous data ingestion and analysis across multiple regions, several aspects must be considered to appropriately schedule configuration changes and maintenance.

Maintenance Window

Although Deephaven does not require daily restarts, it is best practice to select at least a weekly maintenance window where Deephaven can be brought to a quiescent state, restarted, maintenance performed, and if necessary software upgraded.

Schema deployment and new logger/listener generation should be performed during this maintenance window. While the Deephaven processes are stopped, process log files described in the Deephaven Operations Guide may be cleaned.

Persistent Queries

Persistent queries whose scripts refer to specific dates typically must be restarted daily. In practice, this applies to nearly all queries, since data tables are almost always partitioned by date. The appropriate time to restart a query depends on the region it serves and the schedules of the data imports it relies on. It is possible to use different times and time zones for each query.

In most cases, a single query can serve multiple regions, as long as there are overlapping periods of inactivity. Otherwise, separate queries should be used for each region.

For example, a query script might contain the following code:

myDate = calendar("USNYSE").previousBusinessDay()
t = db.t("MyNamespace", "MyTable").where("Date=myDate")

When running this script, the lastBusinessDateNy() function is executed. Its value is stored as the myDate variable, and a table is accessed and filtered for the date stored in this variable. This happens exactly once, and the results will not change, even if the query worker is kept running for days. As a result, if a persistent query using this script is intended to always show data for "yesterday" from the user's perspective, it must be restarted each day in order to re-execute the code.

Process Restarts

Deephaven Data Labs recommends restarting all Deephaven processes, with the exception of the Persistent Query Controller and Remote Query Dispatchers (query server and merge server), during the weekly maintenance window to prevent undetected situations where the system environment has changed in a way that prevents a process from properly starting. When upgrading the Deephaven system (i.e. installing a new Deephaven RPM), Deephaven Data Labs recommends restarting all Deephaven processes.

Customers may additionally choose to restart the Persistent Query Controller and Remote Query Dispatchers; however, any running queries will be terminated whenever these processes are restarted.

  • The Persistent Query Controller may need restarting to pick up certain configuration changes. If a controller is restarted, all queries that were started by this controller will be stopped, and those within their schedules will be restarted when the controller is running again. Several configuration parameters can be changed without requiring a controller restart (through the use of the PersistentQueryControllerTool), including:
    • The list of database servers
    • Default scheduling parameters
    • Temporary queues
  • The Remote Query Dispatchers (also known as the "query server" and "merge server") may need restarting to pick up changed parameter values, such as maximum concurrent queries or maximum total allowed heap. If a dispatcher is restarted, all queries started by that dispatcher will be stopped, and will not be automatically restarted until their next scheduled start time is reached.

The following processes should be restarted during the maintenance window. Note that many existing installations already restart the Data Import Server and Remote User Table Server every night, and if a weekly maintenance window is to be used instead, these installations will need their cron entries adjusted accordingly.

Authentication Servers

When multiple authentication servers are configured, they should be restarted in a staggered fashion. The first authentication server should be stopped and started, and then after a delay of at least 5 minutes, subsequent authentication servers may be restarted. This ensures one authentication server is available at all times.

ACL Write Server

The ACL Write Server should be restarted weekly to ensure cached resources are released.

DIS (Data Import Server)

The Data Import Server must be restarted during the maintenance window to read newly deployed schemas, and newly generated listeners.

Additionally, the DIS maintains a cache of recently used resources, including operating system file handles. Restarting the DIS ensures those resources are released, and standard operating system cleanup (such as the removal of files marked for deletion) may proceed.

The DIS may require restarts outside the maintenance window for the following reasons:

  • If a new schema is deployed, the DIS will not be able to access it until it is restarted.
  • If a listener class is changed or a new listener class is deployed, the DIS will not be able to access it until it is restarted.

Log Aggregator Service

The Log Aggregator Service should be restarted weekly to ensure cached resources are released.

LTDS (Local Table Data Server)

As with the DIS, the Local Table Data Server should be restarted weekly to ensure cached resources are released.

RTA (Data Import Server for User Data)

The Remote Table Appender is a Data Import Server process that handles user data. In default installations, it is the same process as the DIS. In this case, maintenance is handled by DIS maintenance procedures above.

The RTA maintains a cache of recently used resources, including operating system file handles. Restarting the RTA ensures those resources are released, and standard operating system cleanup (such as the removal of files marked for deletion) may proceed.

Client Update Service

The Client Update Service must be restarted after upgrades to the Deephaven system, or changes to customer provided modules or site-specific configuration files.

Tailer

The tailer should be restarted to find updates to its configuration files (the XML, which specifies the binary log files for which it is searching). In addition, changes to these configuration files that should take effect immediately will require a tailer restart. A tailer restart during business hours will cause a brief pause in the ingestion of real-time data.