Skip to main content
Version: Java (Groovy)

Select and create columns

You will often want to select, create, or modify columns in tables. There are five table operations available for this:

All five operations allow you to address column selection and projection on real-time (updating), static, and snapshotted tables.

In all cases, the syntax follows a mostly similar pattern. All five of the scripts below produce the same resultant table, but how and when the engine manifests the results is different in each case.

source = newTable(intCol("Column1", 1, 1, 2, 3, 4)).update("Column2 = i*2")

tableSelect = source.select("Column1", "Column2", "NewColumnA = Column1 + Column2^2"
)
tableView = source.view("Column1", "Column2", "NewColumnA = Column1 + Column2^2")

tableUpdate = source.update("NewColumnA = Column1 + Column2^2")

tableUpdateView = source.updateView("NewColumnA = Column1 + Column2^2")

tableLazyUpdate = source.lazyUpdate("NewColumnA = Column1 + Column2^2")

Within these selection and update operations you can use query strings to transform data; do math; use common operators, literals, objects, and special variables; cast data types; parse and manipulate strings; handle arrays; use built-in functions; operate on time; bin data; introduce ternaries; and address other vital use cases by generating a new column, manipulating an existing column, or combining and decorating multiple columns from a table. You can also use Java methods, Groovy, and user-defined and 3rd-party library functions within these table operations.

For example, below is an inelegant, but demonstrative script of using query strings within an update operation, as representative of the other four methods. The projections below are just just a taste of what you can do with these operations and query strings.

// Create a sample table
source = newTable(
stringCol("X", "A", "B", "C", "D", "E", "F", "G"),
intCol("Y", 1, -2, 3, -4, 5, -6, 7),
intCol("Z", 2, 3, 1, 2, 3, 1, 2),
)

// Basic example formula
f = { a, b ->
return a + b
}

// Demonstrate some projections
result = source.update(
"ColSpecialChar = i",
"ColOperator = Y*Z/3",
"ColCast = (int)ColOperator",
"ColBuilInFcn = absAvg(Y, Z)",
"ColFormula = (int)f(Y, Z)",
"ColTernary = ColCast <= 1 ? pow(ColFormula,2) : pow(ColFormula,3)",
"ColStringStuff = X + `_hello`",
)

Choose the right column selection method

Your decision about which of the five methods to use largely hinges, case by case, on how you answer the following two questions.

"Do you want all columns in the original table or only the ones you itemized in your script?"

Notice below how tSelect and tView return only Column2, whereas the others return the original Column1, as well as the new Column2.

source = emptyTable(5).update("Column1 = i")

// select() and view() have identical syntax
// These create a table only with Column2
tSelect = source.select("Column2 = Column1^2")
tView = source.view("Column2 = Column1^2")

// update(), updateView(), and lazyUpdate() have identical syntax
// These create a table with the original set of columns (i.e., Column1) plus Column2
tUpdate = source.update("Column2 = Column1^2")
tUpdateView = source.updateView("Column2 = Column1^2")
tLazyUpdate = source.lazyUpdate("Column2 = Column1^2")

"Do you want to write new columns to memory, or calculate them on demand?"

In this context, on demand implies that the new columns are not stored in memory. Rather, a formula is stored that is used to recalculate each cell every time it is accessed by a downstream node in the DAG, client, or other consumer.

  • Both view and updateView handle column calculation in this way. (As noted previously, updateView returns the table plus additionally articulated columns, whereas view returns only the selected columns.)
  • select and update calculate columns incrementally, writing the results to memory.
  • view, updateView, and lazyUpdate will calculate new columns' cells only on-demand.

It is recommended you use updateView or view when your use case suggests one, some, or all of the following are true:

  1. the column formula is fast to compute,
  2. only a small portion of the set of rows is being accessed,
  3. cells are accessed very few times, or
  4. memory usage must be minimized.

For other cases, consider using select or update. (lazyUpdate) is more rarely used than the other four.

caution

When using view or updateView, non-deterministic methods (e.g., random numbers, current time, or mutable structures) produce unstable results. Downstream operations on these results produce undefined behavior. Non-deterministic methods should use select or update instead.

Extreme cases #1-4 from the note above are easy to determine. The challenge in determinging the fitness of updateView or view (over their update or select counterparts) comes when use case conditions are less obvious. Here are some good rules of thumb for when to prefer the updateView or view over the alternatives:

  • The table is the last node in the DAG and it isn't obvious how the calculated column will be used.
  • The table updates in real-time, the new column will therefore be ticking, and clients/comsumers thereof will be sporadic.
  • The table is big, the column is intermediate and only used once in downstream calculations.
  • The only consumers of the new column are humans using GUIs (and therefore restricted to the relatively small scale of viewports in their UI).
Further reading

An example exists in a blog post that shows the performance differences in suporting just a UI versus a downstream DAG calculation for a table with 135 million rows. It articulates the performance of respective selection method choices.

When to use lazyUpdate

The lazyUpdate method creates a new table containing a new cached formula column for each argument.

Similar to updateView, with lazyUpdate column formulas are computed on-demand, deferring computation and memory until it is required.

When performing a lazyUpdate, cell values are stored in memory in a cache. Because results are cached (memoized) for the set of input values, the same input values will never be computed twice. Existing results are referenced without additional memory allocation. THis improves performance when the number of distinct inputs are low relative to the number of rows in the table.

note

The syntax for the lazyUpdate, updateView, and update methods is identical, as is the resulting table.

lazyUpdate is recommended for small sets of unique input values. In this case, lazyUpdate uses less memory than update and requires less computation than updateView. However, if there are many unique input values, update will be more efficient because lazyUpdate stores the formula inputs and result in a map, whereas update stores the values more compactly in an array.

Here is an example of lazyUpdate. Because only it contains two values exist in column C (2 and 5), sqrt(2) is computed exactly one time, and sqrt(5) is computed exactly one time. The values are cached for future use, so the subsequent two calculations of sqrt(5) are free from a compute and memory allocation perspective. This is most appropriately used when the distinct set of computational results is small relative to the row count.

source = newTable(
stringCol("A", "The", "At", "Is", "On"),
intCol("B", 1, 2, 3, 4),
intCol("C", 5, 2, 5, 5),
)

result = source.lazyUpdate("Y = sqrt(C)")

Summary of selection methods

The following table showcases Deephaven's five selection methods and provides a quick visual reference for the differences between them.

Source columns in new tableNew column type
Method NameSubsetAllIn-memoryFormulaMemoized
select
view
update
update_view
lazy_update