Merge Optimization

Merge Heap Guidelines

The merge process takes imported data in its raw form, typically partitioned by source and ordered according to its publication, and rewrites it persistently across historical storage partitions, optionally grouping the data on a subset of columns and sorting within those groups.

This is fundamentally a memory-intensive process. Merge must first assign each input row to an output partition, then must order the data for each output partition according to the specified grouping and sorting. Finally, merge must read and rewrite the data for each column in output order.

There are three primary users of memory in a merge.

Indexes and other data structures used to determine ordering
Symbol tables for the output columns
Data buffers used to cache column data as it is read

A good rule of thumb for determining the best value for your merge heap size is to double the value of the data buffer pool size:

buffer pool size = max_over_all_columns(column file data size) * nWritingThreads / nOutputPartitions / 0.85

heap size = 2 \ buffer pool size

Ordering Information

From a memory optimization standpoint, there is not a great deal to be done about the size of ordering information (indexes, etc). The implementation disables some performance optimizations when output partitions exceed 1B rows, choosing to use a potentially more compressed representation (ordered sub-indexes) rather than the default (expanded arrays of input row keys). In the worst case, total ordering information should never significantly exceed 8 bytes per input row once the ordering step is complete. This occurs when data must be substantially re-ordered, such as when the grouping column values are repeated in order ("A", "B", "C", "A", "B", "C"), or when sorting rearranges many rows.

Symbol Tables

Symbol tables allow for significant compression of output data, but are often stored fully in-memory during the merge process, depending on String caching settings as described in the next section. They require storage that can be approximated as (20 bytes per unique string + 2 bytes per character across all unique strings).

Note

We currently store symbol manager data as it is computed for the duration of merge.

Data Buffer Pool

This is by far the most important driver of merge heap usage and merge performance. The data buffer pool will be automatically increased or decreased to be at least 10% and at most 60% of the process heap size, and then rounded down to the nearest multiple of the block size (default 64KB). While the block size can be modified, and the buffer pool can be disabled entirely, neither is advisable.

Users should consider the number of columns that can fit fully in memory and the possible speedup achievable via concurrent writing when tuning this size. In the ideal case, the number of writing threads available should be a multiple of the number of output partitions, corresponding to buffer pool size as follows:

buffer pool size = max_over_all_columns(column file data size) * nWritingThreads / nOutputPartitions / 0.85

In Deephaven versions prior to v.1.20180326, or in low heap usage mode, the / nOutputPartitions term should be disregarded. This is because, in both cases, at most one destination per column is written concurrently. Currently, merge writes all destinations for each column concurrently when possible in throughput mode.

The only properties that should be changed from their defaults are:

DataBufferConfiguration.bufferSize=<total number of bytes in the pool>
iris.concurrentWriteThreads=<number of threads to use when ordering or numbering data>