How to iterate a table using chunks
To most efficiently retrieve data from a Deephaven table, you should use the Deephaven chunk API. The chunk API reads several values from a column all at once rather than reading individual values one at a time. This reduces the need for allocation and virtual method calls which can have negative performance impacts when processing large data sets.
To retrieve data using chunks, you must first create a Context. The ColumnSources provide two types of contexts:
-
GetContext- AGetContextincludes an internal chunk that is either allocated or points directly to the backing store. If you need read-only access to the data, thengetChunkis preferred. -
FillContext- If you need to mutate your data, then you should preferfillChunkand allocate aWritableChunkwithin user code. Each context is created with a size, which is the largest chunk that can be read in a single call.
To create a GetContext or a FillContext, call the ColumnSource's makeGetContext or makeFillContext method, respectively. The context creation methods take a size; you cannot call getChunk or fillChunk with an OrderedKeys that contains more elements than the size of the context you created. These contexts may only be used with the ColumnSource that created them.
We must ensure that we release (close) contexts after use, otherwise we might leak file handles (in the case of disk backed column sources). Most Deephaven code that uses these objects follows the following try-with-resources pattern to ensure that objects are properly released even under exceptional circumstances.
Sometimes there is information that can be shared while reading data that doesn't belong to a specific column source. For example, a sorted column can reuse redirection lookups for the same ordered keys, or an ungrouped column can reuse the sizes of the underlying arrays. To enable reusing these results, we use a SharedContext structure. We create a SharedContext, which is then passed to each of the ColumnSource, makeGetContext or makeFillContext calls. When transitioning between chunks, you must call the reset() method on the SharedContext, so that cached values are discarded, and the query engine does not incorrectly use those cached values on the next Chunk. Internally, the SharedContext contains a map so that columns with the same structure share values, but those that have different structures do not share values.
The Index provides an ordered list of longs that are valid locations in the table. We use the OrderedKeyIterator to go through the Index in chunks; creating a new OrderedKeys for each chunk that we would like to read from the table. The smaller slice of OrderedKeys serves as the argument to the getChunk or fillChunk calls.
The SharedContext, GetContext, FillContext, and OrderedKeysIterator all may allocate internal resources. When you have finished using them, you must close() them. Deephaven recommends using the try-with-resources pattern for these contexts.
In the following example we read data from two column sources and accumulate values in an ArrayList and a MutableLong. In practical usage, you should avoid the use of ArrayList and ImmutablePair for primitive types as boxing the primitives creates additional garbage.
In the next example, we use a WritableChunk passed into the fillChunk method instead of a getChunk call. The semantics are identical to the prior example, but the chunk can be modified with the set method, or by passing it to a function (e.g., sort) that expects a WritableChunk.