Select and create columns

You will often want to select, create, or modify columns in tables. There are five table operations available for this:

All five operations allow you to address column selection and projection on real-time (updating), static, and snapshotted tables.

In all cases, the syntax follows a mostly similar pattern. All five of the scripts below produce the same resultant table, but how and when the engine manifests the results is different in each case.

from deephaven import new_table
from deephaven.column import int_col

source = new_table([int_col("Column_1", [1, 1, 2, 3, 4])]).update(
    formulas=["Column_2 = i*2"]
)

table_select = source.select(
    formulas=["Column_1", "Column_2", "New_Column_A = Column_1 + Column_2^2"]
)
table_view = source.view(
    formulas=["Column_1", "Column_2", "New_Column_A = Column_1 + Column_2^2"]
)

table_update = source.update(formulas=["New_Column_A = Column_1 + Column_2^2"])
table_update_view = source.update_view(["New_Column_A = Column_1 + Column_2^2"])

table_lazy_update = source.lazy_update(["New_Column_A = Column_1 + Column_2^2"])

Within these selection and update operations you can use query strings to transform data; do math; use common operators, literals, objects, and special variables; cast data types; parse and manipulate strings; handle arrays; use built-in functions; operate on time; bin data; introduce ternaries; and address other vital use cases by generating a new column, manipulating an existing column, or combining and decorating multiple columns from a table. You can also use Java methods, Python, and user-defined and 3rd-party library functions within these table operations.

For example, below is an inelegant, but demonstrative script of using query strings within an update operation, as representative of the other four methods. The projections below are just just a taste of what you can do with these operations and query strings.

# Create a sample table
from deephaven import new_table
from deephaven.column import string_col, int_col

source = new_table(
    [
        string_col("X", ["A", "B", "C", "D", "E", "F", "G"]),
        int_col("Y", [1, -2, 3, -4, 5, -6, 7]),
        int_col("Z", [2, 3, 1, 2, 3, 1, 2]),
    ]
)


# Basic example formula
def f(a, b):
    return a * b


# Demonstrate some projections
result = source.update(
    formulas=[
        "Col_Special_Char = i",
        "Col_Operator = Y*Z/3",
        "Col_Cast = (int)Col_Operator",
        "Col_BuilIn_Fcn = absAvg(Y, Z)",
        "Col_Formula = (int)f(Y, Z)",
        "Col_Ternary = Col_Cast <= 1 ? pow(Col_Formula,2) : pow(Col_Formula,3)",
        "Col_String_Stuff = X + `_hello`",
    ]
)

Choose the right column selection method

Your decision about which of the five methods to use largely hinges, case by case, on how you answer the following two questions.

"Do you want all columns in the original table or only the ones you itemized in your script?"

select and view return only the columns you itemize.
update, update_view, and lazy_update return all of the original columns of the table, as well as the columns you itemize.

Notice below how t_select and t_view return only Column_2, whereas the others return the original Column_1, as well as the new Column_2.

from deephaven import empty_table

source = empty_table(5).update(formulas=["Column_1 = i"])

# select() and view() have identical syntax
# These create a table only with Column_2
t_select = source.select(formulas=["Column_2 = Column_1^2"])
t_view = source.view(formulas=["Column_2 = Column_1^2"])

# update(), update_view(), and lazy_update() have identical syntax
# These create a table with the original set of columns (i.e., Column_1) plus Column_2
t_update = source.update(formulas=["Column_2 = Column_1^2"])
t_update_view = source.update_view(formulas=["Column_2 = Column_1^2"])
t_lazy_update = source.lazy_update(formulas=["Column_2 = Column_1^2"])

"Do you want to write new columns to memory, or calculate them on demand?"

In this context, on demand implies that the new columns are not stored in memory. Rather, a formula is stored that is used to recalculate each cell every time it is accessed by a downstream node in the DAG, client, or other consumer.

Both view and update_view handle column calculation in this way. (As noted previously, update_view returns the table plus additionally articulated columns, whereas view returns only the selected columns.)
select and update calculate columns incrementally, writing the results to memory.
view, update_view, and lazy_update will calculate new columns' cells only on-demand.

It is recommended you use update_view or view when your use case suggests one, some, or all of the following are true:

the column formula is fast to compute,
only a small portion of the set of rows is being accessed,
cells are accessed very few times, or
memory usage must be minimized.

For other cases, consider using select or update. (lazy_update) is more rarely used than the other four.

Caution

When using view or update_view, non-deterministic methods (e.g., random numbers, current time, or mutable structures) produce unstable results. Downstream operations on these results produce undefined behavior. Non-deterministic methods should use select or update instead.

Extreme cases #1-4 from the note above are easy to determine. The challenge in determinging the fitness of update_view or view (over their update or select counterparts) comes when use case conditions are less obvious. Here are some good rules of thumb for when to prefer the update_view or view over the alternatives:

The table is the last node in the DAG and it isn't obvious how the calculated column will be used.
The table updates in real-time, the new column will therefore be ticking, and clients/comsumers thereof will be sporadic.
The table is big, the column is intermediate and only used once in downstream calculations.
The only consumers of the new column are humans using GUIs (and therefore restricted to the relatively small scale of viewports in their UI).

Note

Further reading An example exists in a blog post that shows the performance differences in suporting just a UI versus a downstream DAG calculation for a table with 135 million rows. It articulates the performance of respective selection method choices.

When to use `lazy_update`

The lazy_update method creates a new table containing a new cached formula column for each argument.

Similar to update_view, with lazy_update column formulas are computed on-demand, deferring computation and memory until it is required.

When performing a lazy_update, cell values are stored in memory in a cache. Because results are cached (memoized) for the set of input values, the same input values will never be computed twice. Existing results are referenced without additional memory allocation. THis improves performance when the number of distinct inputs are low relative to the number of rows in the table.

Note

The syntax for the lazy_update, update_view, and update methods is identical, as is the resulting table.

lazy_update is recommended for small sets of unique input values. In this case, lazy_update uses less memory than update and requires less computation than update_view. However, if there are many unique input values, update will be more efficient because lazy_update stores the formula inputs and result in a map, whereas update stores the values more compactly in an array.

Here is an example of lazy_update. Because only it contains two values exist in column C (2 and 5), sqrt(2) is computed exactly one time, and sqrt(5) is computed exactly one time. The values are cached for future use, so the subsequent two calculations of sqrt(5) are free from a compute and memory allocation perspective. This is most appropriately used when the distinct set of computational results is small relative to the row count.

from deephaven import new_table
from deephaven.column import string_col, int_col

source = new_table(
    [
        string_col("A", ["The", "At", "Is", "On"]),
        int_col("B", [1, 2, 3, 4]),
        int_col("C", [5, 2, 5, 5]),
    ]
)

result = source.lazy_update(formulas=["Y = sqrt(C)"])

Summary of selection methods

The following table showcases Deephaven's five selection methods and provides a quick visual reference for the differences between them.

	Source columns in new table		New column type
Method Name	Subset	All	In-memory	Formula	Memoized
select
view
update
update_view
lazy_update