How to iterate a table using chunks
To most efficiently retrieve data from a Deephaven table, you should use the Deephaven chunk API. The chunk API reads several values from a column all at once rather than reading individual values one at a time. This reduces the need for allocation and virtual method calls which can have negative performance impacts when processing large data sets.
To retrieve data using chunks, you must first create a Context
. The ColumnSources
provide two types of contexts:
-
GetContext
- AGetContext
includes an internal chunk that is either allocated or points directly to the backing store. If you need read-only access to the data, thengetChunk
is preferred. -
FillContext
- If you need to mutate your data, then you should preferfillChunk
and allocate aWritableChunk
within user code. Each context is created with a size, which is the largest chunk that can be read in a single call.
To create a GetContext
or a FillContext
, call the ColumnSource's makeGetContext
or makeFillContext
method, respectively. The context creation methods take a size; you cannot call getChunk
or fillChunk
with an OrderedKeys
that contains more elements than the size of the context you created. These contexts may only be used with the ColumnSource
that created them.
We must ensure that we release (close) contexts after use, otherwise we might leak file handles (in the case of disk backed column sources). Most Deephaven code that uses these objects follows the following try-with-resources pattern to ensure that objects are properly released even under exceptional circumstances.
Sometimes there is information that can be shared while reading data that doesn't belong to a specific column source. For example, a sorted column can reuse redirection lookups for the same ordered keys, or an ungrouped column can reuse the sizes of the underlying arrays. To enable reusing these results, we use a SharedContext
structure. We create a SharedContext
, which is then passed to each of the ColumnSource
, makeGetContext
or makeFillContext
calls. When transitioning between chunks, you must call the reset()
method on the SharedContext
, so that cached values are discarded, and the query engine does not incorrectly use those cached values on the next Chunk
. Internally, the SharedContext
contains a map so that columns with the same structure share values, but those that have different structures do not share values.
The Index provides an ordered list of longs that are valid locations in the table. We use the OrderedKeyIterator
to go through the Index in chunks; creating a new OrderedKeys
for each chunk that we would like to read from the table. The smaller slice of OrderedKeys
serves as the argument to the getChunk
or fillChunk
calls.
The SharedContext
, GetContext
, FillContext
, and OrderedKeysIterator
all may allocate internal resources. When you have finished using them, you must close()
them. Deephaven recommends using the try-with-resources pattern for these contexts.
In the following example we read data from two column sources and accumulate values in an ArrayList and a MutableLong. In practical usage, you should avoid the use of ArrayList and ImmutablePair for primitive types as boxing the primitives creates additional garbage.
final ArrayList<ImmutablePair<String, Long>> values;
final MutableLong total = new MutableLong();
// if our table is smaller than our chunk size, there is no need to allocate a bigger context
final int chunkSize = (int)Math.min(CHUNK_SIZE, index.size());
try (final SharedContext sharedContext = SharedContext.makeSharedContext();
final ChunkSource.GetContext d2context = dim2ColumnSource.makeGetContext(chunkSize, sharedContext);
final ChunkSource.GetContext rvcontext = rvColumnSource.makeGetContext(chunkSize, sharedContext);
final OrderedKeys.Iterator okit = index.getOrderedKeysIterator()) {
// The hasMore call is equivalent to "hasNext" on a single-item iterator.
while (okit.hasMore()) {
// Before proceeding with the get operations, we must reset the SharedContext; otherwise we may
// incorrectly use shared context values from the prior iteration.
sharedContext.reset();
// We retrieve a child OrderedKeys from the iterator, that will have at most chunkSize elements
final OrderedKeys chunkOk = okit.getNextOrderedKeysWithLength(chunkSize);
// We get a chunk with the values our "D2" string value.
final ObjectChunk<String, ? extends Values> d2chunk = dim2ColumnSource.getChunk(d2context, chunkOk).asObjectChunk();
// And do the same for the random values chunk.
final LongChunk<? extends Values> rvchunk = rvColumnSource.getChunk(rvcontext, chunkOk).asLongChunk();
// We now iterate the chunks in parallel. The chunk element access is non-virtual. You can only
// call the get method on a typed Chunk. If you need to handle many data types in a complex way,
// then you must write a kernel for each data type.
//
// The com.illumon.iris.db.v2.utils.ChunkBoxer converts data to an object, but further operations
// come at a significant performance penalty compared to using native Java data types.
for (int ii = 0; ii < d2chunk.size(); ++ii) {
values.add(new ImmutablePair<>(d2chunk.get(ii), rvchunk.get(ii)));
total.add(rvchunk.get(ii));
}
}
}
In the next example, we use a WritableChunk
passed into the fillChunk
method instead of a getChunk
call. The semantics are identical to the prior example, but the chunk can be modified with the set
method, or by passing it to a function (e.g., sort) that expects a WritableChunk
.
final int chunkSize = (int)Math.min(CHUNK_SIZE, index.size());
try (final SharedContext sharedContext = SharedContext.makeSharedContext();
final ChunkSource.FillContext d2context = dim2ColumnSource.makeFillContext(chunkSize, sharedContext);
final ChunkSource.FillContext rvcontext = rvColumnSource.makeFillContext(chunkSize, sharedContext);
final OrderedKeys.Iterator okit = index.getOrderedKeysIterator();
final WritableObjectChunk<String, Values> writableObjectChunk = WritableObjectChunk.makeWritableChunk(chunkSize);
final WritableLongChunk<Values> writableLongChunk = WritableLongChunk.makeWritableChunk(chunkSize)) {
// The hasMore call is equivalent to "hasNext" on a single-item iterator.
while (okit.hasMore()) {
// Before proceeding with the get operations, we must reset the SharedContext; otherwise we may
// incorrectly use shared context values from the prior iteration.
sharedContext.reset();
// We retrieve a child OrderedKeys from the iterator, that will have at most chunkSize elements
final OrderedKeys chunkOk = okit.getNextOrderedKeysWithLength(chunkSize);
// We get a chunk with the values our "D2" string value.
dim2ColumnSource.fillChunk(d2context, writableObjectChunk, chunkOk);
// And do the same for the random values chunk.
rvColumnSource.fillChunk(rvcontext, writableLongChunk, chunkOk);
// We now iterate the chunks in parallel. The chunk element access is non-virtual. You can only
// call the get method on a typed Chunk. If you need to handle many data types in a complex way,
// then you must write a kernel for each data type.
//
// The com.illumon.iris.db.v2.utils.ChunkBoxer converts data to an object, but further operations
// come at a significant performance penalty compared to using native Java data types.
for (int ii = 0; ii < writableObjectChunk.size(); ++ii) {
values.add(new ImmutablePair<>(writableObjectChunk.get(ii), writableLongChunk.get(ii)));
total.add(writableLongChunk.get(ii));
}
}
}