How to iterate a table using chunks

To most efficiently retrieve data from a Deephaven table, you should use the Deephaven chunk API. The chunk API reads several values from a column all at once rather than reading individual values one at a time. This reduces the need for allocation and virtual method calls which can have negative performance impacts when processing large data sets.

To retrieve data using chunks, you must first create a Context. The ColumnSources provide two types of contexts:

  • GetContext - A GetContext includes an internal chunk that is either allocated or points directly to the backing store. If you need read-only access to the data, then getChunk is preferred.

  • FillContext - If you need to mutate your data, then you should prefer fillChunk and allocate a WritableChunk within user code. Each context is created with a size, which is the largest chunk that can be read in a single call.

To create a GetContext or a FillContext, call the ColumnSource's makeGetContext or makeFillContext method, respectively. The context creation methods take a size; you cannot call getChunk or fillChunk with an OrderedKeys that contains more elements than the size of the context you created. These contexts may only be used with the ColumnSource that created them.

We must ensure that we release (close) contexts after use, otherwise we might leak file handles (in the case of disk backed column sources). Most Deephaven code that uses these objects follows the following try-with-resources pattern to ensure that objects are properly released even under exceptional circumstances.

Sometimes there is information that can be shared while reading data that doesn't belong to a specific column source. For example, a sorted column can reuse redirection lookups for the same ordered keys, or an ungrouped column can reuse the sizes of the underlying arrays. To enable reusing these results, we use a SharedContext structure. We create a SharedContext, which is then passed to each of the ColumnSource, makeGetContext or makeFillContext calls. When transitioning between chunks, you must call the reset() method on the SharedContext, so that cached values are discarded, and the query engine does not incorrectly use those cached values on the next Chunk. Internally, the SharedContext contains a map so that columns with the same structure share values, but those that have different structures do not share values.

The Index provides an ordered list of longs that are valid locations in the table. We use the OrderedKeyIterator to go through the Index in chunks; creating a new OrderedKeys for each chunk that we would like to read from the table. The smaller slice of OrderedKeys serves as the argument to the getChunk or fillChunk calls.

The SharedContext, GetContext, FillContext, and OrderedKeysIterator all may allocate internal resources. When you have finished using them, you must close() them. Deephaven recommends using the try-with-resources pattern for these contexts.

In the following example we read data from two column sources and accumulate values in an ArrayList and a MutableLong. In practical usage, you should avoid the use of ArrayList and ImmutablePair for primitive types as boxing the primitives creates additional garbage.

final ArrayList<ImmutablePair<String, Long>> values;
final MutableLong total = new MutableLong();

// if our table is smaller than our chunk size, there is no need to allocate a bigger context
final int chunkSize = (int)Math.min(CHUNK_SIZE, index.size());

try (final SharedContext sharedContext = SharedContext.makeSharedContext();
     final ChunkSource.GetContext d2context = dim2ColumnSource.makeGetContext(chunkSize, sharedContext);
     final ChunkSource.GetContext rvcontext = rvColumnSource.makeGetContext(chunkSize, sharedContext);
     final OrderedKeys.Iterator okit = index.getOrderedKeysIterator()) {

    // The hasMore call is equivalent to "hasNext" on a single-item iterator.
    while (okit.hasMore()) {
        // Before proceeding with the get operations, we must reset the SharedContext; otherwise we may
        // incorrectly use shared context values from the prior iteration.
        sharedContext.reset();

        // We retrieve a child OrderedKeys from the iterator, that will have at most chunkSize elements
        final OrderedKeys chunkOk = okit.getNextOrderedKeysWithLength(chunkSize);

        // We get a chunk with the values our "D2" string value.
        final ObjectChunk<String, ? extends Values> d2chunk = dim2ColumnSource.getChunk(d2context, chunkOk).asObjectChunk();

        // And do the same for the random values chunk.
        final LongChunk<? extends Values> rvchunk = rvColumnSource.getChunk(rvcontext, chunkOk).asLongChunk();

        // We now iterate the chunks in parallel.  The chunk element access is non-virtual.  You can only
        // call the get method on a typed Chunk.  If you need to handle many data types in a complex way,
        // then you must write a kernel for each data type.
        //
        // The com.illumon.iris.db.v2.utils.ChunkBoxer converts data to an object, but further operations
        // come at a significant performance penalty compared to using native Java data types.
        for (int ii = 0; ii < d2chunk.size(); ++ii) {
            values.add(new ImmutablePair<>(d2chunk.get(ii), rvchunk.get(ii)));
            total.add(rvchunk.get(ii));
        }
    }
}

In the next example, we use a WritableChunk passed into the fillChunk method instead of a getChunk call. The semantics are identical to the prior example, but the chunk can be modified with the set method, or by passing it to a function (e.g., sort) that expects a WritableChunk.

final int chunkSize = (int)Math.min(CHUNK_SIZE, index.size());

try (final SharedContext sharedContext = SharedContext.makeSharedContext();
     final ChunkSource.FillContext d2context = dim2ColumnSource.makeFillContext(chunkSize, sharedContext);
     final ChunkSource.FillContext rvcontext = rvColumnSource.makeFillContext(chunkSize, sharedContext);
     final OrderedKeys.Iterator okit = index.getOrderedKeysIterator();
     final WritableObjectChunk<String, Values> writableObjectChunk = WritableObjectChunk.makeWritableChunk(chunkSize);
     final WritableLongChunk<Values> writableLongChunk = WritableLongChunk.makeWritableChunk(chunkSize)) {

    // The hasMore call is equivalent to "hasNext" on a single-item iterator.
    while (okit.hasMore()) {
        // Before proceeding with the get operations, we must reset the SharedContext; otherwise we may
        // incorrectly use shared context values from the prior iteration.
        sharedContext.reset();

        // We retrieve a child OrderedKeys from the iterator, that will have at most chunkSize elements
        final OrderedKeys chunkOk = okit.getNextOrderedKeysWithLength(chunkSize);

        // We get a chunk with the values our "D2" string value.
        dim2ColumnSource.fillChunk(d2context, writableObjectChunk, chunkOk);

        // And do the same for the random values chunk.
        rvColumnSource.fillChunk(rvcontext, writableLongChunk, chunkOk);

        // We now iterate the chunks in parallel.  The chunk element access is non-virtual.  You can only
        // call the get method on a typed Chunk.  If you need to handle many data types in a complex way,
        // then you must write a kernel for each data type.
        //
        // The com.illumon.iris.db.v2.utils.ChunkBoxer converts data to an object, but further operations
        // come at a significant performance penalty compared to using native Java data types.
        for (int ii = 0; ii < writableObjectChunk.size(); ++ii) {
            values.add(new ImmutablePair<>(writableObjectChunk.get(ii), writableLongChunk.get(ii)));
            total.add(writableLongChunk.get(ii));
        }
    }
}