Deephaven's version 0.33 release introduced several new features and improvements. Two methods that have the potential to save users a lot of time are the new Rolling Formula
update_by
operations: rolling_formula_tick
and rolling_formula_time
. These operations support user-defined formulas, which can be used to calculate rolling aggregations over a specified window of rows or time.
This is exciting for a few reasons.
First of all, support for user-defined formulas opens up a whole new world of versatility for update_by
. Rather than being limited to a single update_by
operation, such as a simple Rolling Sum
, users can define a custom operation that is as complex as their use cases require.
Second, Rolling Formula
operations perform significantly better than calling a Rolling Group
operation followed by update
, which is how a user would have created a custom rolling aggregation in Deephaven version 0.32 and earlier.
rolling_formula_tick
vs rolling_group_tick
and update
Let's put the new rolling_formula_tick
operation to the test and see just how much faster it is than the old method. In the following example, we will create a table that will run a test a given number of times, time the runs, throw out the fastest and slowest result, and print the average for each test. We will then run the test with various table sizes and unique values.
# Run tests
from deephaven.updateby import rolling_group_tick, rolling_formula_tick, rolling_sum_tick
from deephaven import empty_table
from time import time
n_rows = 1_000_000
n_unique_vals = 10_000
n_tests = 5
times_group_update = [0] * n_tests
times_formula = [0] * n_tests
times_sum = [0] * n_tests
for test_num in range(n_tests):
test_table = empty_table(n_rows).update(f"IntCol = randomInt(0, {n_unique_vals})")
# `rolling_group_tick` + `update` test
start = time()
t_grouped_updated = test_table.update_by(ops=rolling_group_tick(cols="X = IntCol", rev_ticks=1000), by=None).update("X = sum(X)")
end = time()
times_group_update[test_num] = end - start
# `rolling_formula_tick` test
start = time()
t_formula = test_table.update_by(ops=rolling_formula_tick(formula="sum(each)", formula_param="each", cols="X = IntCol", rev_ticks=1000), by=None)
end = time()
times_formula[test_num] = end - start
# Print output
times_group_update.sort()
times_group_update = times_group_update[1:-1]
times_formula.sort()
times_formula = times_formula[1:-1]
print("Rolling Group + Update")
print(f"\tMin: {min(times_group_update):.4f} seconds.")
print(f"\tMax: {max(times_group_update):.4f} seconds.")
print(f"\tAvg: {(sum(times_group_update) / len(times_group_update)):.4f} seconds.")
print("Rolling Formula")
print(f"\tMin: {min(times_formula):.4f} seconds.")
print(f"\tMax: {max(times_formula):.4f} seconds.")
print(f"\tAvg: {(sum(times_formula) / len(times_formula)):.4f} seconds.")
rolling_formula_tick
was consistently much faster thanrolling_group_tick
followed byupdate
- nearly four times faster.
This is because with rolling_formula_tick
, data in the window is loaded and available for quick access by the provided formula. However, with both rolling_group_tick
and rolling_formula_tick
, the formula is applied to every value in the window for every ouput row, which means the performance is 0(m*n)
, where m
is the average size of the window and n
is the number of output rows.
Rolling Formula
vs dedicated aggregations
Rolling Formula
operations are a significant improvement over Rolling Group
operations followed by update
. However, Rolling Formula
operations still aren't the fastest way to perform the operation in the previous example. When a dedicated aggregation exists, it's fastest to use the dedicated aggregation - in this case, rolling_sum_tick
. The code block is the same as shown above, except we replace our previous test lines with a test of rolling_sum_tick
.
To view the entire example in one code block, click here.
# Run tests
from deephaven.updateby import rolling_group_tick, rolling_formula_tick, rolling_sum_tick
from deephaven import empty_table
from time import time
n_rows = 1_000_000
n_unique_vals = 10_000
n_tests = 5
times_group_update = [0] * n_tests
times_formula = [0] * n_tests
times_sum = [0] * n_tests
for test_num in range(n_tests):
test_table = empty_table(n_rows).update(f"IntCol = randomInt(0, {n_unique_vals})")
# `rolling_sum_tick` test
start = time()
t_sum = test_table.update_by(ops=rolling_sum_tick(cols="X = IntCol", rev_ticks=1000), by=None)
end = time()
times_sum[test_num] = end - start
# Print output
times_sum.sort()
times_sum = times_sum[1:-1]
print("Rolling Sum")
print(f"\tMin: {min(times_sum):.4f} seconds.")
print(f"\tMax: {max(times_sum):.4f} seconds.")
print(f"\tAvg: {(sum(times_sum) / len(times_sum)):.4f} seconds.")
That's 25x faster than
rolling_formula_tick
!
Dedicated aggregations are optimized to perform the given operation as efficiently as possible and should be used whenever possible. If there is no dedicated aggregation that meets your needs, Rolling Formula
operations are the next best option.
Reach out
Our Community documentation has all of the resources you need to become a Deephaven power user. Our Slack community continues to grow, and we'd love to have you join us! If you have any questions, comments, or suggestions, please reach out to us there.