How to iterate a table using chunks

To most efficiently retrieve data from a Deephaven table, you should use the Deephaven chunk API. The chunk API reads several values from a column all at once rather than reading individual values one at a time. This reduces the need for allocation and virtual method calls which can have negative performance impacts when processing large data sets.

To retrieve data using chunks, you must first create a Context. The ColumnSources provide two types of contexts:

  • GetContext - A GetContext includes an internal chunk that is either allocated or points directly to the backing store. If you need read-only access to the data, then GetChunk is preferred.

  • FillContext - If you need to mutate your data, then you should prefer FillChunk and allocate a WritableChunk within user code. Each context is created with a size, which is the largest chunk that can be read in a single call.

We must ensure that we release (close) contexts after use, otherwise we might leak file handles (in the case of disk backed column sources). Most Deephaven code that uses these objects follows the following try-with-resources pattern to ensure that objects are properly released even under exceptional circumstances.

We create a SharedContext, which is then passed to each of the ColumnSource makeGetContext calls. The SharedContext allows columns that have identical backing information to reuse data between the getChunk calls. For example, a sorted column can reuse redirection lookups for the same ordered keys.

The Index provides an ordered list of longs that are valid locations in the table. We use the OrderedKeyIterator to go through the Index in chunks; creating a new OrderedKeys for each chunk that we would like to read from the table. The smaller slice of OrderedKeys serves as the argument to the GetChunk call.

In the following example we read data from two column sources and accumulate values in an ArrayList and a MutableLong. In practical usage, you should avoid the use of ArrayList and ImmutablePair for primitive types as boxing the primitives creates additional garbage.

final ArrayList<ImmutablePair<String, Long>> values;
final MutableLong total = new MutableLong();

// if our table is smaller than our chunk size, there is no need to allocate a bigger context
final int chunkSize = (int)Math.min(CHUNK_SIZE, index.size());

try (final SharedContext sharedContext = SharedContext.makeSharedContext();
     final ChunkSource.GetContext d2context = dim2ColumnSource.makeGetContext(chunkSize, sharedContext);
     final ChunkSource.GetContext rvcontext = rvColumnSource.makeGetContext(chunkSize, sharedContext);
     final OrderedKeys.Iterator okit = index.getOrderedKeysIterator()) {

    // The hasMore call is equivalent to "hasNext" on a single-item iterator.
    while (okit.hasMore()) {
        // Before proceeding with the get operations, we must reset the SharedContext; otherwise we may
        // incorrectly use shared context values from the prior iteration.
        sharedContext.reset();

        // We retrieve a child OrderedKeys from the iterator, that will have at most chunkSize elements
        final OrderedKeys chunkOk = okit.getNextOrderedKeysWithLength(chunkSize);

        // We get a chunk with the values our "D2" string value.
        final ObjectChunk<String, ? extends Values> d2chunk = dim2ColumnSource.getChunk(d2context, chunkOk).asObjectChunk();

        // And do the same for the random values chunk.
        final LongChunk<? extends Values> rvchunk = rvColumnSource.getChunk(rvcontext, chunkOk).asLongChunk();

        // We now iterate the chunks in parallel.  The chunk element access is non-virtual.  You can only
        // call the get method on a typed Chunk.  If you need to handle many data types in a complex way,
        // then you must write a kernel for each data type.
        //
        // The com.illumon.iris.db.v2.utils.ChunkBoxer converts data to an object, but further operations
        // come at a significant performance penalty compared to using native Java data types.
        for (int ii = 0; ii < d2chunk.size(); ++ii) {
            values.add(new ImmutablePair<>(d2chunk.get(ii), rvchunk.get(ii)));
            total.add(rvchunk.get(ii));
        }
    }
}