How to iterate a table using chunks

To most efficiently retrieve data from a Deephaven table, you should use the Deephaven chunk API. The chunk API reads several values from a column all at once rather than reading individual values one at a time. This reduces the need for allocation and virtual method calls which can have negative performance impacts when processing large data sets.

To retrieve data using chunks, you must first create a Context. The ColumnSources provide two types of contexts:

  • GetContext - A GetContext includes an internal chunk that is either allocated or points directly to the backing store. If you need read-only access to the data, then GetChunk is preferred.

  • FillContext - If you need to mutate your data, then you should prefer FillChunk and allocate a WritableChunk within user code. Each context is created with a size, which is the largest chunk that can be read in a single call.

We must ensure that we release (close) contexts after use, otherwise we might leak file handles (in the case of disk backed column sources). Most Deephaven code that uses these objects follows the following try-with-resources pattern to ensure that objects are properly released even under exceptional circumstances.

We create a SharedContext, which is then passed to each of the ColumnSource makeGetContext calls. The SharedContext allows columns that have identical backing information to reuse data between the getChunk calls. For example, a sorted column can reuse redirection lookups for the same ordered keys.

The Index provides an ordered list of longs that are valid locations in the table. We use the OrderedKeyIterator to go through the Index in chunks; creating a new OrderedKeys for each chunk that we would like to read from the table. The smaller slice of OrderedKeys serves as the argument to the GetChunk call.

In the following example we read data from two column sources and accumulate values in an ArrayList and a MutableLong. In practical usage, you should avoid the use of ArrayList and ImmutablePair for primitive types as boxing the primitives creates additional garbage.