Worker launch

A Deephaven Worker instance is the primary place within a Deephaven system where "work" is done. A worker process may be interactive, where a user or downstream process sends individual commands, or a Persistent Query (PQ), which is pre-defined and may be set to execute on a schedule. A worker may be used to manipulate and display data to users, ingest data into the system, or perform any number of tasks.

Architecture

Worker processes are spawned by the RemoteQueryDispatcher via shell-script with several parameters. These RemoteQueryDispatcher processes are identified as the db_query_server process and the db_merge_server process. A dispatcher process runs on each node within the cluster that is identified with a QUERY and/or MERGE role during installation. In a Kubernetes environment, these processes are run as the query-server and merge-server deployments. A worker process is spawned running as the same user as that of the RemoteQueryDispatcher which launched it, unless per-user workers is enabled.

A db_query_server and a db_merge_server are very similar, but the processes are run by different users. By default, a db_query_server instance is run as the dbquery user, and a db_merge_server instance is run as the dbmerge user. These default users may be defined differently during initial system installation. The dbmerge user is able to write historical data, so access to a db_merge_server process should be limited to administrative and system users. The dbquery workers spawned by the db_query_server are able to read intraday and historical data, but cannot write historical data.

Each worker is given a unique ProcessInfoId by the RemoteQueryDispatcher. This identifier is useful for troubleshooting. During worker startup, stdout and stderr are captured by the RemoteQueryDispatcher, and sent to the Process Event Log on behalf of the worker, identified by the ProcessInfoId. Once the worker has started, it can write to the ProcessEventLog (PEL) on its own by writing binary logs.

Troubleshooting

There are a number of reasons why a worker may fail to start. In many cases, the cause can be determined by examining a stack trace for the worker-spawn attempt. For a failed PQ, the stack trace may be found in the ExceptionDetails column in the Query Monitor / Query Config tab of the IDE. In many cases, the ProcessInfoId for the worker is also listed in the ProcessInfoId column. For an interactive worker session, the exception is found in the Code Studio / Console that launched the attempt.
In some cases, a problem with the RemoteQueryDispatcher process may prevent successful worker launches. In this case, information may be found in a plain-text logfile for the RemoteQueryDispatcher process within the /var/log/deephaven/query_server or /var/log/deephaven/merge_server directory of the node where the worker launch was attempted.

If the worker process has started but crashes during initialization, additional details may be found in the ProcessEventLog. See Finding errors for details on troubleshooting via the ProcessEventLog.

Common worker startup issues

Note

This is not intended to be a complete list of possible errors.

CauseTroubleshooting
RemoteQueryDispatcher or LogAggregatorService is not started.The status of these services should be checked on the node where the worker is being launched. Run the command /usr/illumon/latest/bin/dh_monit summary to check the status of these processes.

If the appropriate RemoteQueryDispatcher is not started (db_query_server and/or db_merge_server), then the worker cannot be launched. If the LogAggregatorService is not started (log_aggregator_service), then logs for the worker will not be captured.

Start the RemoteQueryDispatcher service(s) and/or LogAggregatorService with sudo -u irisadmin /usr/illumon/latest/bin/dh_monit start ... for the appropriate service (system administrative privileges will be required).
Worker requested too much heap.There is a default "cumulative heap per RemoteQueryDispatcher" defined by the RemoteQueryDispatcher.maxTotalQueryProcessorHeapMB, which may be overridden per RemoteQueryDispatcher instance in iris-environment.prop. Even if there is sufficient memory installed in the system, cumulative worker memory cannot exceed the value defined by this property.

Similarly, the RemoteQueryDispatcher.maxPerWorkerHeapMB property may be defined, limiting the maximum heap allowed per worker. By default, this property is not defined, and a given worker may allocate memory up to the total RemoteQueryDispatcher.maxTotalQueryProcessorHeapMB value configured.

Ensure that the properties are set correctly and that the worker is not requesting more heap than is available.
Worker crashes during initialization.If the worker process has started but is unable to complete initialization, the reason should be identified in the ProcessEventLog. This may be caused by syntactical errors in the script (for a PQ), by missing resources on the classpath (plugins not installed/activated on the particular node), etc.