Deephaven Production Log Monitoring

This article serves as an introduction much of the content detailed in full in the Deephaven Operations Guide and Process Runbooks. It is not intended as a comprehensive guide.

Standard Log Layouts

Note

See: Deephaven Operations Guide: Log Files

Most production Deephaven processes have multiple log files:

Startup Script Log

This also captures the stdout and stderr streams from the process.

This file appears in the process’s log directory, with the following name format:

process_name.log.yyyy-MM-dd

Where the date is the process startup date.

This file is not rotated. Be careful with automated log cleanup for long-running processes in order to ensure that this file is not deleted while still in-use.

Garbage Collection Log

Captures detailed information from the JVM’s garbage collector.

This file appears in the process’s log directory, with the following name format:

process_name.log.gc_detailed.yyyy-MM-dd_HH-mm-SS

Where the timestamp is the process startup time.

This file is not rotated. Be careful with automated log cleanup for long-running processes in order to ensure that this file is not deleted while still in-use.

The format of records in this file may change with different JVM versions or garbage collector settings.

Application Logs

Capture data explicitly logged via the application-level logging framework.

These files appear in the process’s log directory, with the following name format:

ProcessMainClassName.log.yyyy-MM-dd-HHmmss.SSS-ZZZZ

Where the timestamp is the file’s creation time, including a time zone.

These files are rotated every 15 minutes. The most recent file in a sequence is linked as:

ProcessMainClassName.log.current

Records in these files include a “level” describing the relative importance and type of a log message, as well as timestamps.

Worker Logs (Including In-Worker System Processes)

Deephaven workers do not produce the files described in this section. Worker logs are sent to the Process Event Log table directly or via the RemoteQueryDispatcher, depending on the type of log event.

This has implications for monitoring system processes running as workers. Such monitoring must either be done by querying the Process Event Log table, or by consuming binary logs that are written by the LogAggregatorService as an intermediate step.

Process Event Log

Many processes, including Deephaven query workers (RemoteQueryProcessor) omit writing the above log files, and instead send the same data to a Deephaven table, DbInternal/ProcessEventLog. This is configurable.

Common Startup and Shutdown Logging

Most or all Deephaven processes use an internal library for logging their lifecycle, which can be monitored externally. Monitoring tasks can look for the following strings:

EventLogProcessName INITIALIZING - This is logged very early in startup, as the logging framework is initialized.
ProcessMainClassName starting up - This is also logged very early, as part of the environment is initialized.
EventLogProcessName RUNNING - This is logged when process-specific initialization is complete and the process is ready.
ProcessMainClassName shutting down - This is logged when the environment shutdown is triggered.
EventLogProcessName SHUTTING_DOWN - This is logged as one of the initial operations during shutdown processing.

Shutdown processing has a little more structure to it, if that is necessary to monitor. Each process has three ordered groups of shutdown tasks, denoted first, middle, and last. The logging around this process looks like:

Initiating shutdown processing
Starting to invoke FIRST shutdown tasks
Done invoking FIRST shutdown tasks
Starting to invoke MIDDLE shutdown tasks
Done invoking MIDDLE shutdown tasks
Starting to invoke LAST shutdown tasks
Done invoking LAST shutdown tasks
Finished shutdown processing

During shutdown processing, Deephaven processes attempt to log as much usable information as possible in order to report on exceptional conditions or errors during cleanup, although this becomes error prone as the logger framework itself must ultimately be shutdown during the last phase, truncating some of the final messages.

Exceptional Conditions to Monitor

Many Deephaven processes are by design resilient to a wide variety of errors. That said, some issues (e.g. resource exhaustion) are unrecoverable. In general, all Deephaven processes reserve the log level “FATAL” for issues that should result in exceptional termination.

Common Data Pipeline Errors

The Deephaven data pipeline consists of the graph of LogTailer, LogAggregatorService, DataImportServer, LocalTableDataServer, and TableDataCacheProxy processes involved in ingesting and serving data.

Data Buffer Pool

The DataImportServer, LocalTableDataServer, TableDataCacheProxy, and query worker processes (including those used for merge) all operate around an internal pool of 64KB binary buffers used to read, write, and cache binary data. While buffer size is technically configurable, it must be globally consistent for the entire data pipeline, and Deephaven has found that 64KB strikes a good balance between read/write throughput and read/cache amplification.

This pool is highly configurable. See Data Buffer Pool Configuration for documentation of the options available. Most processes only adjust the total size of the pool, which is controlled by one of the following properties (or by the Data Memory Ratio field for a Persistent Query:

DataBufferPool.sizeInBytes (deprecated in Deephaven v1.20200331 and after)
DataBufferConfiguration.poolSize - allows more fluent specification, e.g. 24g, rather than simple a raw number of bytes

Pool sizing is very important for cache performance in workers, especially for the merge operations that re-write intraday data in a historical format.

Pool sizing is even more important at DataImportServer instances. A DIS must be able to allocate one buffer per column file per for each open partition (i.e., partitions with a connected tailer). If a DIS cannot make progress because it has completely exhausted the buffer pool, and it cannot free up any buffer space with a synchronous cleanup, the process will terminate.

Key log lines to look for:

AutoReclaimingObjectPool-DataBufferPool: Failed to take() an item for thread=<thread name>, initiating synchronous cleanup - This means a thread was unable to acquire a buffer in a lock free manner, and will wait and attempt to reclaim space on the current thread. It generally means that the pool is under pressure, and may indicate that the configuration should be changed to increase the pool size or the frequency and aggressiveness of concurrent cleanup.
AutoReclaimingObjectPool-DataBufferPool: Unable to take() an item for thread=<thread name>, yielding (<#>/<#>)- This means that a thread was unable to acquire a buffer even after performing synchronous cleanup. This means that all buffers are actively in use and none could be freed. The thread will yield the CPU in order to allow other threads to make progress and hopefully release one or more buffers.
AutoReclaimingObjectPool-DataBufferPool: Unable to take() an item for thread=<thread name> after <#> yields - This means that a thread was unable to acquire a buffer after multiple attempts to synchronously clean up and yield the CPU. This is a fatal error, and will be followed by an ObjectPoolExhaustedError with message AutoReclaimingObjectPool-DataBufferPool: Yield limit encountered - no pooled objects available! Subsequent threads that might encounter this scenario during shutdown will instead throw an ObjectPoolExhaustedException with message AutoReclaimingObjectPool-DataBufferPool: Yield limit previously encountered - no pooled objects available!

TableDataException Logs

Many errors in reading data will result in a log message including a TableDataException. While these will often occur in workers, it may be worthwhile to monitor for them in all system processes that are part of read flow in the data pipeline, i.e., DataImportServer, LocalTableDataServer, TableDataCacheProxy.

Generally these will include the name of the service that could not provide data, an explanation of the exception, and details that may aid in investigations. They almost always point to a problem in the system, either due to data buffer pool exhaustion, invalid caches due to data deletion outside of Deephaven-internal mechanisms, software issues, or underlying issues with hardware or file systems.

The key to investigating these issues and reaching a speedy resolution is checking the logs of the process that originated the chain of errors. Often for intraday data reads, this is the DataImportServer.

Resolving cache invalidity issues, which often show up as failures to read an offset past the end of a file, usually requires a restart of the impacted processes. This typically means the DIS or LTDS serving the data, any intervening TDCPs, and the query worker(s) accessing the data.

An example of such an error follows:

Caused by: com.illumon.iris.db.v2.locations.TableDataException: RemoteTableDataService-Remote_db_ltds: Received rejection for request (requestKey=12279): FileDataBufferStore[OffsetFileAccessor[LocalFileAccessor[/db/Intraday/Namespace/TableName/IntradayPartition/2019-04-08/TableName/Symbol.dat],131072]]: Read failure: startOffset=0, minimumLength=6872, localSize=6872, size=6872, bufferIndex=0, minimumLimit=6872, sizeImpliedLimit=6872, requiredLimit=6872

Schema Synchronization Errors

At startup, the DIS reads schemas from the configuration service, and loads listener classes needed to parse binary log data received from tailers. If the schema or a listener for a tailed data stream changes while the DIS is running, that stream will stop receiving updates. The details of the problem will be listed in the DIS and tailer logs, and will usually be in the form of IllegalStateExceptions, RuntimeExceptions, or IllegalArgumentExceptions stating that a column did not match a name, did not match a type, or was not found, or that the version number of a binary log file did not match the version numbers of available listener classes.

Correcting such a situation may require:

Deploying updated schema
Regenerating loggers and listeners
Updating remote loggers to use latest schemata and logger classes
Restarting loggers
Restarting the DIS process(es)

The most common scenario is that schemas must be deployed, the DIS needs to be restarted, or the schema change needs to be properly managed to not happen intra-partition.

Log messages related to this will generally include text along the lines of:

Channel handling error while processing stream: java.lang.IllegalArgumentException: Unknown column - This means that a column in the data received from the tailer is not found in the schema that the DIS has available to it. Likely needs to be resolved with a schema deployment.
Channel handling error while processing stream: java.lang.RuntimeException: Can not find schema Listener element- This means the DIS cannot find the listener code; either it was not deployed, or the DIS must be restarted to find it.
Channel handling error while processing stream: java.lang.IllegalStateException: - Often this is going to be an error wherein some binary log files for a partition were written with a newer schema version, and cannot be imported to the same partition without manual intervention.

That said, not all channel handling errors are of severe nature; the impact depends on the nature of the error and the importance of the data.

Common Garbage Collection Errors

Note

See also: GC Log Patterns in the Deephaven Operations Guide

This is up-to-date for the garbage collectors commonly used with Deephaven in Java 8. This type of monitoring is suitable for all system components, although care should be taken not to over-prioritize non-essential query workers in this way.

Of these, concurrent mode failure and Full GC are the least ambiguous; they generally point to a process that is having difficulties making enough free heap available to proceed.