Each Deephaven Core release brings new features to get excited about. One of this year's biggest and most exciting is the update_by
table operation. Its introduction to the table API has transformed what used to be difficult and tedious doable in a single function call.
What is update_by
?
update_by
adds new capability for computing inter-row statistics. This enables efficient and expressive cross-row calculations, such as the differences between rows, or an exponential moving average. It can perform two types of operations: cumulative and rolling.
- Cumulative operations are applied to an entire table, such as a cumulative sum.
- Rolling operations are applied to a subset of rows in a table, such as a rolling sum over the previous ten rows or the previous ten seconds.
update_by
is highly efficient and will only recompute the absolute minimum amount of data required for each calculation. It retains the current table rows and simply appends the aggregating columns. It's also parallelized - concurrency can be achieved even in Python, which is typically limited by the GIL.
Why is it called update_by
?
This operation's name is derived from the combination of two of Deephaven's most commonly used operations - update
and agg_by
.
- It's like
update
because it updates table data, creating new columns with specified calculations. - It's like
agg_by
because the calculations implemented byupdate_by
are done on grouped (aggregated) sets of data. As with other aggregations, they can be done on a per-group basis, or applied to an entire table.
Examples
Examples are more illustrative of capabilities than a long list. So, here are a few that show what update_by
can do.
Rolling averages
The following example calculates row-based and time-based rolling averages of a column with numeric data. No grouping columns are given, so the calculations are applied to the whole table. The row-based moving average calculates an average given the previous row, current row, and next row. The same is true for the time-based average, which includes all rows within plus or minus 1.5 seconds from the current row.
from deephaven.updateby import rolling_avg_tick, rolling_avg_time
from deephaven.time import parse_instant
from deephaven import empty_table
base_time = parse_instant("2023-04-01T10:00:00 ET")
source = empty_table(25).update(["Timestamp = base_time + i * SECOND", "X = i"])
row_based_avg = rolling_avg_tick(cols=["RowAvgX = X"], rev_ticks=2, fwd_ticks=1)
time_based_avg = rolling_avg_time(ts_col="Timestamp", cols=["TimeAvgX = X"], rev_time="PT1.5S", fwd_time="PT1.5S")
result = source.update_by(ops=[row_based_avg, time_based_avg], by=[])
EMAs grouped by a symbol
The following example calculates row and time-based EMAs of the same data as the example above. This time around, the Symbol
column is given as the grouping column. It contains alternating symbols Tau
and Epsilon
, so the EMAs are calculated on a per-symbol basis.
from deephaven.updateby import ema_tick, ema_time
from deephaven.time import parse_instant
from deephaven import empty_table
base_time = parse_instant("2023-04-01T10:00:00 ET")
source = empty_table(25).update(["Timestamp = base_time + i * SECOND", "Symbol = (i % 2 == 0) ? `Tau` : `Epsilon`", "X = i"])
row_based_ema = ema_tick(decay_ticks=5, cols=["RowEmaX = X"])
time_based_ema = ema_time(ts_col="Timestamp", decay_time="PT5S", cols=["TimeEmaX = X"])
result = source.update_by(ops=[row_based_ema, time_based_ema], by=["Symbol"])
result_tau = result.where(["Symbol = `Tau`"])
result_epsilon = result.where(["Symbol = `Epsilon`"])
Rolling group in real-time
The following example calculates a rolling group on real-time data.
from deephaven.updateby import rolling_group_tick, rolling_group_time
from deephaven.time import parse_instant
from deephaven import time_table
base_time = parse_instant("2023-04-01T10:00:00 ET")
source = time_table("PT0.5S").update(["Symbol = (i % 2 == 0) ? `Tau` : `Epsilon`", "X = i"])
row_group_op = rolling_group_tick(cols=["RowGroupX = X"], rev_ticks=4, fwd_ticks=5)
time_group_op = rolling_group_time(ts_col="Timestamp", cols=["TimeGroupX = X"], rev_time="PT5S", fwd_time="PT3S")
result = source.update_by(ops=[row_group_op, time_group_op], by=["Symbol"])
result_tau = result.where(["Symbol = `Tau`"])
result_epsilon = result.where(["Symbol = `Epsilon`"])
What calculations can be performed with update_by
?
A lot. Here's the full list:
Operation name | Description |
---|---|
cum_max | Cumulative maximum |
cum_min | Cumulative minimum |
cum_prod | Cumulative product |
cum_sum | Cumulative Sum |
delta | Difference between adjacent rows |
ema_tick | Row-based EMA |
ema_time | Time-based EMA |
emmax_tick | Row-based exponential moving maximum |
emmax_time | Time-based exponential moving maximum |
emmin_tick | Row-based exponential moving minimum |
emmin_time | Time-based exponential moving mimnimum |
ems_tick | Row-based exponential moving sum |
ems_time | Time-based exponential moving sum |
forward_fill | Forward-fill null values |
rolling_avg_tick | Row-based moving average |
rolling_avg_time | Time-based moving average |
rolling_count_tick | Row-based moving count |
rolling_count_time | Time-based moving count |
rolling_group_tick | Row-based moving group |
rolling_group_time | Time-based moving group |
rolling_std_tick | Row-based moving standard deviation |
rolling_std_time | Time-based moving standard deviation |
rolling_sum_tick | Row-based moving sum |
rolling_sum_time | Time-based moving sum |
rolling_wavg_tick | Row-based moving weighted average |
rolling_wavg_time | Time-based moving weighted average |
Reach out
Have any questions for us? Reach out on Slack!