Add columns and perform calculations

Deephaven offers several methods for selecting (or deselecting) columns of data in a query. These methods also enable users to eliminate or modify existing columns of data, or create new columns of data. However, choosing the right method for your query depends on a number of factors - do you want to keep all the columns from the source table in your result? How big is your dataset? Will your analyses run faster by storing the data in memory or by computing calculations "on-the-fly"?

select and view

The select and view methods are typically used to select which columns from a particular source table you'd like to retain in a new table. You can also use formulas to define and create new columns; however, the important point is that only the columns specifically included in the argument will appear in the new table.

For example, in the previous section we wrote a query that reduced the number of columns to four:

t = db.t("LearnDeephaven" , "StockTrades")\
    .where("Date=`2017-08-23`")\
    .sort("Sym")\
    .renameColumns("Symbol = Sym")\
    .view("Timestamp", "Symbol", "Last", "Size")\
    .head(10)

Using select instead would give the same results:

t = db.t("LearnDeephaven" , "StockTrades")\
    .where("Date=`2017-08-23`")\
    .sort("Sym")\
    .renameColumns("Symbol = Sym")\
    .head(10)\
    .select("Timestamp", "Symbol", "Last", "Size")

The difference between the two is that view() is lazy, while select() is eager. In this context, this means that view() retrieves data values from on-disk column files just-in-time, when they are needed for display or for use in another query, while select() retrieves all values from the selected columns into memory when the query is executed.

If a query's results will be displayed in the UI in a grid, and the user may only view a small subset of the available rows, then view is an excellent choice, because it minimizes disk access and memory utilization by the query. On the other hand, if all the data from the column(s) will be used, and, especially if they will be accessed multiple times, select will provide an overall more efficient set of queries by reading the data once and having it quickly accessible for subsequent uses.

You'll notice that we moved select to the end of the query: having the .select() before the .head() says to read all of the matching values, but then discard all but the first 10 rows. Putting the .head() before the .select() says to get the row indexes of the first 10 rows, and then retrieve the values for them into memory. This simple change makes a big difference in the amount of work the query needs to do.

In some cases, such as when a table's data will be exported to an external application, like Python or R, it will be necessary to force all of the table's data into memory. This is called coalescing the table. It can be accomplished by adding .select() to the end of a query.

update, updateView, and lazyUpdate

Deephaven's user interfaces allow users a great deal of flexibility to sort, filter, and otherwise alter the presentation of data. In conjunction with this, the UI is designed to retrieve only the data being currently displayed. If you scroll through a grid showing a large table, you will notice that the data doesn't "paint" in until you stop scrolling. This just-in-time data retrieval makes the system much more responsive when users are using grids to interactively analyze and browse data.

To support this approach when manipulating data in a query, Deephaven provides three commands that apply the same expressions or formulae in different ways:

  • update - applies the expression and stores the result in a new column.
  • updateView - stores the expression, and applies it just-in-time when retrieving data to display or to use in another expression.
  • lazyUpdate - works like updateView, but stores each result when it's calculated.

Try this example:

t = db.t("LearnDeephaven" , "StockQuotes")\
     .where("Date=`2017-08-25`")\
     .updateView("Spread = Ask - Bid")

This query uses updateView, which is fast to get initial results from a large table being displayed in a grid, because it only needs to calculate enough rows to show to the user. In the example above, Spread was only calculated for the 20 or so rows shown in the UI. However, if the user scrolls to other parts of the table, and then back to the first rows, Spread for those rows will need to be calculated a second time. So, updateView is good when you will usually only be working with a small subset of the result set, and when you usually won't be retrieving the same rows multiple times.

update typically takes the most time and memory. It takes time to do the update for all rows at once, and it takes heap space to store the results; e.g., if the formula produces a double value, update will take 8 bytes per row and the expression will be evaluated for all rows in the table when the query is executed.

Also keep in mind that if we were to write another query:

t2 = t.update("SpreadPct = Spread / Ask * 100.0")

that uses Spread, this query is also going to "touch" the rows and cause Spread to be calculated. Because t2 uses update, it will force the Spread expression to be evaluated for each row. lazyUpdate provides a compromise, by storing the just-in-time results. If lazyUpdate had been used in the example above, then the user scrolling away from the first rows, and back, would not require Spread to be calculated a second time. The caveat here is that it takes more memory and time to keep track of whether a row has had a value calculated for it already, so, if all the rows will end up eventually getting touched, lazyUpdate will take more memory, and more time overall, than update.

These examples are pretty simple, in that they do not use ticking data or aggregations. More advanced queries may end up touching underlying rows multiple times, and, as such, it's important to consider overall how many times a calculated value is likely to be used, when considering whether update, updateView, or lazyUpdate is the best option.