Merge Optimization

Merge Heap Guidelines

The merge process takes imported data in its raw form, typically partitioned by source and ordered according to its publication, and rewrites it persistently across historical storage partitions, optionally grouping the data on a subset of columns and sorting within those groups.

This is fundamentally a memory-intensive process. Merge must first assign each input row to an output partition, then must order the data for each output partition according to the specified grouping and sorting. Finally, merge must read and rewrite the data for each column in output order.

There are three primary users of memory in a merge.

  1. Indexes and other data structures used to determine ordering.
  2. Symbol tables for the output columns.
  3. Data buffers used to cache column data as it is read.

A good rule of thumb for determining the best value for your merge heap size is to double the value of the data buffer pool size:

buffer pool size = max_over_all_columns(column file data size) * nWritingThreads / nOutputPartitions / 0.85

heap size = 2 \ buffer pool size

Ordering Information

From a memory optimization standpoint, there is not a great deal to be done about the size of ordering information (indexes, etc). The implementation disables some performance optimizations when output partitions exceed 1B rows, choosing to use a potentially more compressed representation (ordered sub-indexes) rather than the default (expanded arrays of input row keys). In the worst case, total ordering information should never significantly exceed 8 bytes per input row once the ordering step is complete. This occurs when data must be substantially re-ordered, such as when the grouping column values are repeated in order ("A", "B", "C", "A", "B", "C"), or when sorting rearranges many rows.

Symbol Tables

Symbol tables allow for significant compression of output data, but are often stored fully in-memory during the merge process, depending on String caching settings as described in the next section. They require storage that can be approximated as (20 bytes per unique string + 2 bytes per character across all unique strings).

Note

We currently store symbol manager data as it is computed for the duration of merge.

Data Buffer Pool

The Data Buffer Pool caches table and index data as it is read from disk. Its configuration is relevant for all query workers and everything in the data pipeline, in particular DIS instances. The DataImportServer, LocalTableDataServer, TableDataCacheProxy, and query worker processes (including those used for merge) all operate around an internal pool of 64KB binary buffers used to read, write, and cache binary data.

This is by far the most important driver of merge heap usage and merge performance. The data buffer pool will be automatically increased or decreased to be at least 10% and at most 60% of the process heap size, and then rounded down to the nearest multiple of the block size (default 64KB). While the block size can be modified, and the buffer pool can be disabled entirely, neither is advisable.

Users should consider the number of columns that can fit fully in memory and the possible speedup achievable via concurrent writing when tuning this size. In the ideal case, the number of writing threads available should be a multiple of the number of output partitions, corresponding to buffer pool size as follows:

buffer pool size = max_over_all_columns(column file data size) * nWritingThreads / nOutputPartitions / 0.85

In Deephaven versions prior to v.1.20180326, or in low heap usage mode, the / nOutputPartitions term should be disregarded. This is because, in both cases, at most one destination per column is written concurrently. Currently, merge writes all destinations for each column concurrently when possible in throughput mode.

Some general tuning recommendations:

  • For merges, follow the merge throughput guidelines in the Merging Data documentation.
  • For specific queries, the QueryPerformanceLog and UpdatePerformanceLog provide information about the number and duration of repeated data reads. If you see a large number of repeated reads, you may want to increase the size of your pool. If there are no repeated reads, you may be able to reduce the size of your pool.
  • For DIS instances, the pool must be big enough to hold one block for each file of each partition; for example, our recommended rule of thumb is nPartitions * nColumns * 1.2, assuming most columns are not object columns.

Methods

The methods below provide information about the current configuration.

MethodDescription
getMaximumPoolToHeapSizeRatio()Get the configured maximum decimal ratio between the total size of the data buffer pool and the JVM heap size. This is determined by the DataBufferConfiguration.heapMaxPoolToHeapSizeRatio or DataBufferConfiguration.directMaxPoolToHeapSizeRatio property, depending on the value of DataBufferConfiguration.useDirectMemory.
getMinimumPoolToHeapSizeRatio()Get the configured minimum decimal ratio between the total size of the data buffer pool and the JVM heap size. This is determined by the DataBufferConfiguration.minPoolToHeapSizeRatio property.

Properties

The following properties influence the configuration of the data buffer pool. Properties without default values are mandatory and must be set either in the Deephaven configuration or with -D at the command line.

The only properties that should be changed from their defaults are:

  • DataBufferConfiguration.bufferSize=<total number of bytes in the pool>
  • iris.concurrentWriteThreads=<number of threads to use when ordering or numbering data>
PropertyDefaultDescription
DataBufferConfiguration.bufferSizeN/ASpecifies the size for each buffer used for storing and communicating Deephaven-format binary data. This property supports -"XmX"-style units, i.e., <size>[g|G|m|M|k|K], but must be a positive integer less than 2^30 when converted to bytes.
DataBufferConfiguration.useDirectMemoryfalseSpecifies whether the buffers used for storing and communicating Deephaven-format binary data should be allocated in direct memory, rather than heap memory. Defaults to false, meaning heap memory will be used.
DataBufferConfiguration.poolEnabledN/ASpecifies whether a pool should be used to constrain the number of buffers in use for Deephaven-format data.
DataBufferConfiguration.poolSizeN/AThe total size of the memory allocated to the data buffer pool. Supports "-XmX"-style units, i.e., <size>[g|\G|m|M|k|K]. This value will be clamped between the values of DataBufferConfiguration.minPoolToHeapSizeRatio and DataBufferConfiguration.heapMaxPoolToHeapSizeRatio to prevent the data buffer pool from consuming an inordinate amount of memory, which could adversely affect stability. The resulting number of buffers must be less than Integer.MAX_VALUE. (Note: This property is disregarded if buffer pooling is disabled.)
DataBufferConfiguration.minPoolToHeapSizeRatio0.1Specifies the minimum ratio (as a decimal number) between the data buffer pool size and the JVM heap size. (Note: This property is disregarded if buffer pooling is disabled.)
DataBufferConfiguration.heapMaxPoolToHeapSizeRatio0.6Specifies the maximum ratio (as a decimal number) between the data buffer pool size and the JVM heap size, when heap memory is used. (Note: This property is disregarded if buffer pooling is disabled.)
DataBufferConfiguration.directMaxPoolToHeapSizeRatio2.0Specifies the maximum ratio (as a decimal number) between the data buffer pool size and the JVM heap size, when direct memory is used. (Note: This property is disregarded if buffer pooling is disabled.)
DataBufferConfiguration.poolCleanupThresholdRatioUsed0.9Maximum occupancy of the data buffer pool as a decimal ratio before concurrent cleanup should be performed. This value must be between 0.0 and 1.0. (Note: This property is disregarded if buffer pooling is disabled.)
DataBufferConfiguration.poolCleanupTargetRatioUsed0.6Target occupancy of the data buffer pool as a decimal ratio. This value must be between 0.0 and DataBufferPool.cleanupThresholdRatioUsed. (Note: This property is disregarded if buffer pooling is disabled.)
DataBufferConfiguration.poolCleanupIntervalMillis60000 (60s)Interval between checks for concurrent cleanup. Must be greater than zero. (Note: This property is disregarded if buffer pooling is disabled.)
DataBufferConfiguration.poolClockIntervalMillis10000 (10s)Interval between clock ticks for the data buffer pool's logical clock, used to stamp an approximate last used time on outstanding buffers, which may be used as input for cleanup processing. Must be greater than zero. (Note: This property is disregarded if buffer pooling is disabled.)