# Choose the right selection method for your query

The Deephaven API offers various methods for selecting, updating, eliminating, modifying, and creating columns of data in tables. Your choice of method can have a big impact on query performance. Thus, it's important to understand which to choose for your query and why.

Before we dive into the methods, consider the following questions for a table operation:

- Do I need some or all of the data from a source table in a new table?
- How much new data will I create from my table operation?
- How expensive are the calculations used to create a new column in the new table?
- How many calculations will I need to perform downstream that use data created by one of these methods?
- How expensive are the downstream calculations I will need to perform on the new data in the new table?

The following table showcases Deephaven's five selection methods and highlights the differences between them.

Source columns in new table | New column type | ||||
---|---|---|---|---|---|

Method Name | Subset | All | In-memory | Formula | Memoized |

select | |||||

view | |||||

update | |||||

update_view | |||||

lazy_update |

# Source columns in the new table

## Subset

The user chooses which source columns to include in the new table. `select`

and `view`

both allow the user to choose a subset of columns to take from the source table. Both also allow new columns to be added.

`from deephaven import empty_table`

source = empty_table(5).update(formulas=["X = ii", "Y = sqrt(ii)", "Z = ii * ii"])

result_select = source.select(formulas=["X", "Y", "A = X + Y"])

result_view = source.view(formulas=["X", "Z", "B = X + Z"])

- source
- result_select
- result_view

## All

All columns from the source table are in the new table. `update`

, `update_view`

, and `lazy_update`

all create a new table, plus new columns specified in the method call.

`from deephaven import empty_table`

source = empty_table(5).update(formulas=["X = ii"])

result_update = source.update(formulas=["Y = X * X"])

result_update_view = source.update_view(formulas=["Y = sqrt(X)"])

result_lazy_update = source.lazy_update(formulas=["Y = 2 * X"])

- source
- result_update
- result_update_view
- result_lazy_update

# Column type

## In-memory

When an in-memory column is created, all cell values are computed and stored in memory. Any time cells in an in-memory column are accessed, the values are simply retrieved from memory.

The following methods create in-memory columns:

In the example below, a table called `source`

is created, followed by two new tables called `resultSelect`

and `resultUpdate`

. In `resultSelect`

, the new columns (`X`

and `XSquared`

) are in-memory columns. In `resultUpdate`

, the new columns (`XSquared`

and `YSquared`

) are also in-memory columns.

`from deephaven import empty_table`

source = empty_table(5).update(formulas=[

"X = ii",

"Y = 9 - ii"

])

result_select = source.select(formulas=[

"X",

"XSquared = X * X"

])

result_update = source.update(formulas=[

"XSquared = X * X",

"YSquared = Y * Y"

])

- source
- result_select
- result_update

## Formula

A formula column does not store actual column values in memory. Rather, only the formulas needed to calculate new cell values are stored. Every time a cell is accessed, the calculations specified by the formula are performed to obtain the cell value.

The following methods create formula columns:

In the example below, a table called `source`

is created, followed by two new tables called `result_view`

and `result_update_view`

. In `resultView`

, the new columns (`X`

and `XSquared`

) are formula columns. In `result_update_view`

, the new columns (`XSquared`

and `YSquared`

) are also formula columns.

`from deephaven import empty_table`

source = empty_table(5).update(formulas=[

"X = ii",

"Y = 9 - ii"

])

result_view = source.view(formulas=[

"X",

"XSquared = X * X"

])

result_update_view = source.update_view(formulas=[

"XSquared = X * X",

"YSquared = Y * Y"

])

- source
- result_view
- result_update_view

## Memoized

A memoized column stores the formula needed to calculate new cell values. Every time a cell is accessed, the value is retrieved from a cache of already computed values. If the result is not present in the cache, the cell value is computed and stored in the cache.

The column cache is a hash map, where keys are input column values. As a result, cache size is proportional to the number of unique input column values. If there are few input column values, the cache size will be small. If every row has unique input column values, the cache will be larger than using an in-memory column.

Memoized columns should not be used with non-deterministic functions. For example, if a random number generator is used with a memoized column, the initial result is cached and returned for all future calls.

The following method creates memoized columns:

In the example below, a table called `source`

is created, followed by a new table called `result_lazy_update`

. In `result_lazy_update`

, the new columns (`XSqrt`

and `XOffset`

) are memoized columns. Initially, the caches are empty, but as cells are accessed, the cache is populated. Eventually, the `XSqrt`

cache will contain the mappings `(0) -> sqrt(0)`

, `(1) -> sqrt(1)`

, and `(2) -> sqrt(2)`

, and the `XOffset`

cache will contain the mappings `(1) -> 11`

, `(2) -> 12`

, `(3) -> 13`

.

`from deephaven import empty_table`

source = empty_table(5).update(formulas=[

"X = ii % 3"

])

result_lazy_update = source.lazy_update(formulas=[

"XSqrt = sqrt(X)",

"XOffset = X + 11"

])

- source
- result_lazy_update

# Performance analysis

Column types can have a significant impact on query performance. In this section, we will dive deeper into what happens with different column types, and how they impact query performance. To illustrate how queries are affected, we will time how long some table operations take to execute. For simplicity, we will time execution using `time`

in Python.

There are two cases studied here that illustrate how inputs to selection operations can affect performance:

- The first involves creating and operating on a table with no repeating column values.
- The second contains only a few unique values repeated many times.

## Case 1: Unique input values

### Make the initial table

This code creates an initial table with two in-memory columns of 10,000,000 rows each. The two columns contain the row index, and the row index + 1, respectively. All values are calculated once and stored in system memory.

`from deephaven import empty_table`

import time

t = empty_table(10_000_000)

start = time.time()

initial_table = t.update(formulas=[

"X = ii",

"Y = ii + 1"

])

end = time.time()

print(f"setup - Elapsed time: {end - start} seconds.")

- Log

It takes about two seconds to calculate and store 20 million values in system memory.

### In-memory columns vs formula columns vs memoized columns

The following code creates three new tables:

- The first has two new in-memory columns.
- The second has two new formula columns.
- The third has two new memoized columns.

Each operation is timed and the execution time is printed to the console.

`start = time.time()`

updated_table = initial_table.update(formulas=[

"SqrtX = sqrt(X)",

"SqrtY = sqrt(Y)"

])

end = time.time()

print(f"update - Elapsed time: {end - start} seconds.")

start = time.time()

update_viewed_table = initial_table.update_view(formulas=[

"SqrtX = sqrt(X)",

"SqrtY = sqrt(Y)"

])

end = time.time()

print(f"update_view - Elapsed time: {end - start} seconds.")

start = time.time()

lazy_updated_table = initial_table.lazy_update(formulas=[

"SqrtX = sqrt(X)",

"SqrtY = sqrt(Y)"

])

end = time.time()

print(f"lazy_update - Elapsed time: {end - start} seconds.")

- Log

None of these operations take very long, but what accounts for the differences in speed?

- The in-memory column is the slowest because it has to calculate and store 20 million new double values in system memory.
- The formula and memoized columns are very fast since they have only stored formulas in memory.

### Downstream operations

The code below performs a downstream `select`

to turn all columns into in-memory columns in the resulting table.

This is performed on all three types of columns, and the execution time for each is printed to the console.

`start = time.time()`

downstream_on_update = updated_table.select()

end = time.time()

print(f"select on update - Elapsed time: {end - start} seconds.")

start = time.time()

downstream_on_update_view = update_viewed_table.select()

end = time.time()

print(f"select on update_view - Elapsed time: {end - start} seconds.")

start = time.time()

downstream_on_lazy_update = lazy_updated_table.select()

end = time.time()

print(f"select on lazy_update - Elapsed time: {end - start} seconds.")

- Log

Here we see that the downstream `select`

is fastest on in-memory columns. Not only that, but the downstream `select`

on memoized columns is extremely slow. How come?

- When performing
`select`

on in-memory columns, almost nothing has to be done, since the in-memory columns already exist. - When performing
`select`

on formula columns, the formulas must be evaluated on every source column. - When performing
`select`

on memoized columns, all source columns must be memoized. Once they are memoized, they are converted into in-memory columns. Memoizing millions of unique values is much slower than just calculating 20 million values.

## Case 2: Repeating input values

This code creates an initial table with two in-memory columns of 10,000,000 rows each. The two columns contain values from 0 to 4, and values from 0 to 2, respectively. Other than the input data values, Case 2 is identical to Case 1.

### Make the initial table

`from deephaven import empty_table`

import time

t = empty_table(1_0000_000)

start = time.time()

initial_table = t.update(formulas=[

"X = ii % 5",

"Y = ii % 3"

])

end = time.time()

print(f"setup - Elapsed time: {end - start} seconds.")

- Log

It takes about two seconds to calculate and store 20 million values in system memory.

### In-memory columns vs formula columns vs memoized columns

The following code creates three new tables:

- The first has two new in-memory columns.
- The second has two new formula columns.
- The third has two new memoized columns.

Each operation is timed and the execution time is printed to the console.

`start = time.time()`

updated_table = initial_table.update(formulas=[

"SqrtX = sqrt(X)",

"SqrtY = sqrt(Y)"

])

end = time.time()

print(f"update - Elapsed time: {end - start} seconds.")

start = time.time()

update_viewed_table = initial_table.update_view(formulas=[

"SqrtX = sqrt(X)",

"SqrtY = sqrt(Y)"

])

end = time.time()

print(f"update_view - Elapsed time: {end - start} seconds.")

start = time.time()

lazy_updated_table = initial_table.lazy_update(formulas=[

"SqrtX = sqrt(X)",

"SqrtY = sqrt(Y)"

])

end = time.time()

print(f"lazy_update - Elapsed time: {end - start} seconds.")

- Log

None of these operations take very long, but what accounts for the differences in speed?

- The in-memory column is the slowest because it has to calculate and store 20 million new double values in system memory.
- The formula column is very fast because it only has to store 2 formulas in system memory.
- The memoized column is the fastest because it does not have to memorize any results just yet.

### Downstream operations

`start = time.time()`

downstream_on_update = updated_table.select()

end = time.time()

print(f"select on update - Elapsed time: {end - start} seconds.")

start = time.time()

downstream_on_update_view = update_viewed_table.select()

end = time.time()

print(f"select on update_view - Elapsed time: {end - start} seconds.")

start = time.time()

downstream_on_lazy_update = lazy_updated_table.select()

end = time.time()

print(f"select on lazy_update - Elapsed time: {end - start} seconds.")

- Log

Here we see again that the downstream `select`

is fastest on in-memory columns. This time, however, the downstream `select`

on memoized columns is signifcantly faster than it was in the first case. How come?

- When performing
`select`

on in-memory columns, almost nothing has to be done, since the in-memory columns already exist. - When performing
`select`

on formula columns, the formulas must be evaluated on every source column. - When performing
`select`

on memoized columns, all source columns must be memoized. Once they are memoized, they are converted into in-memory columns. In this case, there are only 8 unique input values, so only 8 values are memoized. Even then, it is the slowest of the three. For more complex formulas with repeated data, memoized columns can outperform other column types.

# Summary

Let's summarize the key takeaways from what's been presented:

## Which columns do you want from the source table?

### Subset

Both methods allow users to specify which columns from the source table to include in the results table.

Both methods allow users to create new columns derived from columns in the source table.

### All

`update`

,`update_view`

, and`lazy_update`

All methods include all columns from the source table in the result table.

All methods allow users to create new columns derived from columns in the source table.

## What column types do you want to create?

### In-memory

In-memory columns are calculated once and stored in system memory.

In-memory columns are best used when content is expensive to evaluate and accessed many times in downstream operations.

### Formula

`view`

and`update_view`

create new formula columns.Formula columns store the formulas used to compute values.

Values are calculated on-demand as they are accessed.

Formula columns are best used when the formula is fast to compute or only small portions of column data are needed.

### Memoized

`lazy_update`

creates new memoized columns.Memoized columns store formulas used to compute values.

Values are retrieved from a cache. If values are not present in the cache, they are computed and cached. The formula inputs are used as the cache key, so the same inputs are never computed more than once.

Memoized columns are best used on data with a small number of unique input values.