System logs

This guide outlines how to review and search system logs for troubleshooting purposes. It covers text logs and binary logs.

Text logs

Deephaven utilities and processes that are controlled by M/Monit generate text logs which are usually a good starting point when a server process is failing or is not operating correctly.

These logs are written to sub-directories under /var/log/deephaven.

DbInternal binary logs

Some Deephaven processes, especially workers (which are used in consoles and also behind persistent queries), log their details to binary log files, which are then loaded by the data import server into DbInternal tables. See Deephaven internal tables for more details about these tables.

When the system is functional and a console can be started, users can query the ProcessEventLog table to view log output from their own workers. Administrative access is required to view logs from other users' workers (see Deephaven permissions). When consoles are not functional, the iriscat utility can be used to view binary log files.

DbInternal table view

Workers are identified by a UUID (Universally Unique ID), stored in the ProcessInfoId column of various DbInternal log tables. The controller process assigns these IDs at startup. A truncated version of the UUID is stored in the WorkerName column. The WorkerName starts with the string worker_, followed by the first section of the ProcessInfoId. For example, a ProcessInfoId of 6604dd7e-054d-4ce5-9249-93c5dffa739b would have a WorkerName of worker_6604dd7e.

Note

A UUID (Universally Unique ID) is a value generated using unique hardware details of the machine on which a generator algorithm is running, in conjunction with the date and time at which the ID is generated. The UUID ProcessInfoId within Deephaven's scope always uniquely identifies one specific run of a worker process.

When troubleshooting, the most commonly used internal table is the ProcessEventLog (referred to as PEL). The PEL is useful when a Persistent Query has failed, but the exception displayed in the Query Monitor panel does not provide sufficient information about the cause of the failure. It is also useful when a console fails without a useful exception, for example if a process runs out of heap or encounters a fatal exception.

To view the PEL table for a process you must determine a unique identifier for it, preferably the ProcessInfoId. For a Persistent Query, the ProcessInfoId is found in a column in the Query Monitor panel. For interactive workers, the ProcessInfoId can be found by hovering over the ⓘ menu in the console. The ID can be easily copied to the clipboard by clicking on the icon next to the ID.

img

For a Persistent Query that crashed and no longer shows the ProcessInfoId, use the PersistentQueryStateLog table to find the ProcessInfoId.

psl = db.liveTable("DbInternal", "PersistentQueryStateLog").where("Date=today()", "Name=`Import 12`")
psl = db.live_table("DbInternal", "PersistentQueryStateLog").where(
    ["Date=today()", "Name=`Import 12`"]
)

In the query above, the psl table has a ProcessInfoId column containing the unique identifier for all executions of the Import 12 Persistent Query for today.

If the WorkerName for a console is known, the ProcessInfoId can be obtained by querying the AuditEventLog. For example, to get the worker startup audit details for query worker worker_6604dd7e:

ael = db.liveTable("DbInternal", "AuditEventLog").where("Date=today()", "Event=`Starting worker`", "Process=`db_query_server`", "Details.contains(`worker_6604dd7e`)")
ael = db.live_table("DbInternal", "AuditEventLog").where(
    [
        "Date=today()",
        "Event=`Starting worker`",
        "Process=`db_query_server`",
        "Details.contains(`worker_6604dd7e`)",
    ]
)

Once you know the ProcessInfoId, you can query the PEL to get the detailed worker logs from startup through shutdown. If the worker crashed or shut down unexpectedly during startup, there may be limited information available in the log. When this happens you need to inspect logs from the dispatcher on the host your query was running on.

pel = db.liveTable("DbInternal", "ProcessEventLog").where("Date=today()", "ProcessInfoId=`6604dd7e-054d-4ce5-9249-93c5dffa739b3`").sort("Timestamp")
pel = (
    db.live_table("DbInternal", "ProcessEventLog")
    .where(["Date=today()", "ProcessInfoId=`6604dd7e-054d-4ce5-9249-93c5dffa739b`"])
    .sort("Timestamp")
)

This query gets all PEL entries for all workers that were assigned the ProcessInfoId of 6604dd7e-054d-4ce5-9249-93c5dffa739b for today.

The .sort("Timestamp") is required because a process may have multiple loggers writing to the PEL, and those messages will be appended to the log in the order in which they arrived. Especially in the case of a fatal exception handler, the messages may arrive before the main log queue is flushed to the binary log file.

DbInternal iriscat view

When it is not possible to start a console to query the PEL, the iriscat utility can be used to view binary log files. iriscat is used to dump binary log files from disk to a text format where they can be read and parsed using standard *NIX command-line tools. The binary logs on each Deephaven server are written to /var/log/deephaven/binlogs/pel.

For example, to dump the contents of a particular PEL binary log file to stdout:

/usr/illumon/latest/bin/iriscat -l /var/log/deephaven/binlogs/pel/DbInternal.ProcessEventLog.System.deephaven-query-1.2023-02-14.bin.2023-02-14.214340.599+0000

The -l directive formats the data as a log file, including formatting long epoch offsets to datetime values for easier reading.

A common practice is to search the formatted log file for a particular ProcessInfoId and sort the result.

/usr/illumon/latest/bin/iriscat -l /var/log/deephaven/binlogs/pel/DbInternal.ProcessEventLog.System.*.2023-02-14* | grep '6604dd7e-054d-4ce5-9249-93c5dffa739b' | sort | less