UpdateBy

Note

For Core+ workers, see Community Core documentation for update_by.

UpdateBy provides users with a way to apply operations that must account for the values of other rows within the same table in an efficient manner that properly takes updates into account.

The updateBy method applies a set of user-specified operations to all rows within the table, producing a new table with the same index as the parent. Users may provide one or more operations to compute at once.

// For a single operation.
result = source.updateBy(@NotNull final UpdateByClause operations, final String... byColumns)

// For multiple operations,  Clauses may be grouped together with UpdateByClause.of()
result = source.updateBy(@NotNull final Collection<UpdateByClause> operations, final String... byColumns)

Note

UpdateByClause must be imported into groovy sessions before use with import com.illumon.iris.db.v2.updateby.UpdateByClause

Example

The following will perform a forward fill on one column and a cumulative sum on another grouped by Sym.

result = parent.updateBy(UpdateByClause.of(
                         UpdateByClause.fill("Price"),
                         UpdateByClause.sum("VolSum=Volume")), "Sym")

from deephaven import UpdateBy as ubc
result = parent.updateBy(ubc.of(\
                         ubc.fill("Price")\
                         ubc.sum("VolSum=Volume")), "Sym)

Forward Fill

Forward fill will replace any null values found in the current column and group with the most recent non-null value that occurred before each row. If no columns are passed, all non-grouping columns will be filled.

UpdateByClause.fill(String... columnsToFill)

ubc.fill(columns_to_fill)

Cumulative Sum

Cumulative Sum computes the sum of each row and all previous rows. If no columns are passed, all non-grouping columns will be computed. Only numerical types such as byte, short, int, long ,float, double, BigInteger and BigDecimal may be summed The result column for integral types such as byte, short, int and long will be long. The result column for float, double, BigDecimal and BigInteger will be the same type.

UpdateByClause.sum(String... columnsToSum)

ubc.sum(columns_to_sum)

Cumulative Product

Cumulative Product computes the product of each row and all previous rows. If no columns are passed, all non-grouping columns will be computed. Only numerical types such as byte, short, int, long ,float, double, BigInteger and BigDecimal may participate in a cumulative product. The result column for integral types such as byte, short, int and long will be long. The result column for floating point types float and double, as well as BigDecimal and BigInteger will be the same type.

UpdateByClause.prod(String... columns)

ubc.prod(columns)

Cumulative Minimum and Maximum

Cumulative Minimum and Maximum computes the minimum or maximum value of a column at each row. If no columns are passed, all non-grouping columns will be computed.

UpdateByClause.min(String... columns)
UpdateByClause.max(String... columns)

ubc.min(columns)
ubc.max(columns)

Exponential Moving Average (EMA)

Compute an EMA for a column. Deephaven supports computing EMAs using either time or ticks as the base time unit. The formula used for the tick-based version is:

a = e^{-1/\tau}

Where the time decay unit, $\tau$ , is the number of ticks.

The formula for the time-based version is:

a = e^{-\Delta t/\tau}

Where the time decay unit, $\tau$ , is time in nanoseconds.

The EMA function is:

EMA(col_n) = a * EMA(col_{n-1}) + (1 - a) * col_n

You can also control how the EMA function handles different forms of bad input data, including null values, null timestamps, negative timestamps, NaNs and $\Delta t = 0$ using the EmaControl object.

See the Javadoc for more details.

UpdateByClause.ema("Timestamp",
        30 * MINUTE,
        EmaControl.get().onNullValue(BadDataBehavior.Reset),
        "Price")

from deephaven import UpdateBy as ubc
from deephaven.UpdateBy import EmaControl as emac

ubc.ema(
    "Timestamp", 30 * MINUTE, emac.get().onNullValue(BadDataBehavior.Reset), "Price"
)

Delta

The Delta operation provides an efficient way to compute the cell-to-cell delta for a column:

\Delta Col = Col_i - Col_{i-1}

You can control how Delta handles the case of $Col_i - null$ using the DeltaControl object as follows.

NullBehavior	Effect
ValueDominates (default)	$Col_i - null = Col_i$
NullDominates	$Col_i - null = null$

Note

Normally using the Col_[i-1] operator requires that subsequent rows belong to the same group, and care must be taken when transitioning from one group to another. For example, if a table is sorted by instrument you must take care when transitioning from one instrument to another not to inappropriately mix data from two distinct instruments. Instead UpdateBy, and by extension the delta operation, is group aware, so there is no need to expend any effort organizing the input table.

UpdateByClause.delta(String... columns)
UpdateByClause.delta(DeltaControl control, String... columns)

from deephaven import UpdateBy as ubc
from deephaven.UpdateBy import DeltaControl as dc

ubc.delta(control, columns)

All updateBy operations support fully dynamic tables. However, performance will be largely dependent on the characteristics of the table updates. For example, tables that are globally append-only, or append-only within the groups that exist within the grouping columns, updateBy will perform well. The updateBy operation will be linear in the number of input rows in the update (which will also be the number of rows in the output update). On the other hand, if the operation changes rows that already have been computed in a bucket (e.g. by adding, removing, or modifying rows at the beginning of a bucket), the updateBy operation's result will cascade through the entire bucket (potentially requiring a full table rescan). This also implies that any downstream update, must necessarily modify those changed rows resulting in tick expansion.