Select and create columns
You will often want to select, create, or modify columns in tables. There are five table operations available for this:
All five operations allow you to address column selection and projection on real-time (updating), static, and snapshotted tables.
In all cases, the syntax follows a mostly similar pattern. All five of the scripts below produce the same resultant table, but how and when the engine manifests the results is different in each case.
source = newTable(intCol("Column1", 1, 1, 2, 3, 4)).update("Column2 = i*2")
tableSelect = source.select("Column1", "Column2", "NewColumnA = Column1 + Column2^2"
)
tableView = source.view("Column1", "Column2", "NewColumnA = Column1 + Column2^2")
tableUpdate = source.update("NewColumnA = Column1 + Column2^2")
tableUpdateView = source.updateView("NewColumnA = Column1 + Column2^2")
tableLazyUpdate = source.lazyUpdate("NewColumnA = Column1 + Column2^2")
- source
- tableSelect
- tableView
- tableUpdate
- tableUpdateView
- tableLazyUpdate
Within these selection and update operations you can use query strings to transform data; do math; use common operators, literals, objects, and special variables; cast data types; parse and manipulate strings; handle arrays; use built-in functions; operate on time; bin data; introduce ternaries; and address other vital use cases by generating a new column, manipulating an existing column, or combining and decorating multiple columns from a table. You can also use Java methods, Groovy, and user-defined and 3rd-party library functions within these table operations.
For example, below is an inelegant, but demonstrative script of using query strings within an update
operation, as representative of the other four methods. The projections below are just just a taste of what you can do with these operations and query strings.
// Create a sample table
source = newTable(
stringCol("X", "A", "B", "C", "D", "E", "F", "G"),
intCol("Y", 1, -2, 3, -4, 5, -6, 7),
intCol("Z", 2, 3, 1, 2, 3, 1, 2),
)
// Basic example formula
f = { a, b ->
return a + b
}
// Demonstrate some projections
result = source.update(
"ColSpecialChar = i",
"ColOperator = Y*Z/3",
"ColCast = (int)ColOperator",
"ColBuilInFcn = absAvg(Y, Z)",
"ColFormula = (int)f(Y, Z)",
"ColTernary = ColCast <= 1 ? pow(ColFormula,2) : pow(ColFormula,3)",
"ColStringStuff = X + `_hello`",
)
- source
- result
Choose the right column selection method
Your decision about which of the five methods to use largely hinges, case by case, on how you answer the following two questions.
"Do you want all columns in the original table or only the ones you itemized in your script?"
select
andview
return only the columns you itemize.update
,updateView
, andlazyUpdate
return all of the original columns of the table, as well as the columns you itemize.
Notice below how tSelect
and tView
return only Column2
, whereas the others return the original Column1
, as well as the new Column2
.
source = emptyTable(5).update("Column1 = i")
// select() and view() have identical syntax
// These create a table only with Column2
tSelect = source.select("Column2 = Column1^2")
tView = source.view("Column2 = Column1^2")
// update(), updateView(), and lazyUpdate() have identical syntax
// These create a table with the original set of columns (i.e., Column1) plus Column2
tUpdate = source.update("Column2 = Column1^2")
tUpdateView = source.updateView("Column2 = Column1^2")
tLazyUpdate = source.lazyUpdate("Column2 = Column1^2")
- source
- tSelect
- tView
- tUpdate
- tUpdateView
- tLazyUpdate
"Do you want to write new columns to memory, or calculate them on demand?"
In this context, on demand implies that the new columns are not stored in memory. Rather, a formula is stored that is used to recalculate each cell every time it is accessed by a downstream node in the DAG, client, or other consumer.
- Both
view
andupdateView
handle column calculation in this way. (As noted previously,updateView
returns the table plus additionally articulated columns, whereasview
returns only the selected columns.) select
andupdate
calculate columns incrementally, writing the results to memory.view
,updateView
, andlazyUpdate
will calculate new columns' cells only on-demand.
It is recommended you use updateView
or view
when your use case suggests one, some, or all of the following are true:
- the column formula is fast to compute,
- only a small portion of the set of rows is being accessed,
- cells are accessed very few times, or
- memory usage must be minimized.
For other cases, consider using select
or update
. (lazyUpdate
is more rarely used than the other four.)
When using view
or updateView
, non-deterministic methods (e.g., random numbers, current time, or mutable structures) produce unstable results. Downstream operations on these results produce undefined behavior. Non-deterministic methods should use select
or update
instead.
Extreme cases #1-4 from the note above are easy to determine. The challenge in determinging the fitness of updateView
or view
(over their update
or select
counterparts) comes when use case conditions are less obvious. Here are some good rules of thumb for when to prefer the updateView
or view
over the alternatives:
- The table is the last node in the DAG and it isn't obvious how the calculated column will be used.
- The table updates in real-time, the new column will therefore be ticking, and clients/consumers thereof will be sporadic.
- The table is big, the column is intermediate and only used once in downstream calculations.
- The only consumers of the new column are humans using GUIs (and therefore restricted to the relatively small scale of viewports in their UI).
An example exists in a blog post that shows the performance differences in suporting just a UI versus a downstream DAG calculation for a table with 135 million rows. It articulates the performance of respective selection method choices.
When to use lazyUpdate
The lazyUpdate
method creates a new table containing a new cached formula column for each argument.
Similar to updateView
, with lazyUpdate
column formulas are computed on-demand, deferring computation and memory until it is required.
When performing a lazyUpdate
, cell values are stored in memory in a cache. Because results are cached (memoized) for the set of input values, the same input values will never be computed twice. Existing results are referenced without additional memory allocation. This improves performance when the number of distinct inputs are low relative to the number of rows in the table.
The syntax for the lazyUpdate
, updateView
, and update
methods is identical, as is the resulting table.
lazyUpdate
is recommended for small sets of unique input values. In this case, lazyUpdate
uses less memory than update
and requires less computation than updateView
. However, if there are many unique input values, update
will be more efficient because lazyUpdate
stores the formula inputs and result in a map, whereas update
stores the values more compactly in an array.
Here is an example of lazyUpdate
. Because only it contains two values exist in column C (2 and 5), sqrt(2) is computed exactly one time, and sqrt(5) is computed exactly one time. The values are cached for future use, so the subsequent two calculations of sqrt(5) are free from a compute and memory allocation perspective. This is most appropriately used when the distinct set of computational results is small relative to the row count.
source = newTable(
stringCol("A", "The", "At", "Is", "On"),
intCol("B", 1, 2, 3, 4),
intCol("C", 5, 2, 5, 5),
)
result = source.lazyUpdate("Y = sqrt(C)")
- source
- result
Summary of selection methods
The following table showcases Deephaven's five selection methods and provides a quick visual reference for the differences between them.
Source columns in new table | New column type | ||||
---|---|---|---|---|---|
Method Name | Subset | All | In-memory | Formula | Memoized |
select | |||||
view | |||||
update | |||||
updateView | |||||
lazyUpdate |