Skip to main content
Version: Java (Groovy)

Basic Table Operations

This section will cover some table operations that appear in almost all queries. These table operations use query strings extensively, which are discussed in detail in the next section.

Table operations are integral to the Deephaven Query Language (DQL). The previous sections of the crash course used five:

  • updateView, which adds columns to or modifies existing columns in a table.
  • sumBy, which computes the sum of all columns in a table by a grouping column.
  • dropColumns, which drops columns from a table.
  • sort, which sorts a table by the given columns from least to greatest.
  • sortDescending, which sorts a table by the given columns from greatest to least.

Table operations are an integral component of DQL. You've already seen several: updateView, sumBy, dropColumns, sort and sortDescending. You can think of these as different transformations being applied to the data in the table. This section will outline some basic table operations that make up the backbones of the most common queries.

Many of the code blocks in this notebook use the following table, t, as the root table. This is a simple table with 100 rows and contains only a Timestamp column.

t = emptyTable(100).update("Timestamp = '2015-01-01T00:00:00 ET' + 'PT1m' * ii")

Basic column manipulation

Add and modify columns

update creates in-memory columns. An in-memory column is one where the calculations are performed immediately, and results are stored in memory. The following code block adds three columns to t.

  • IntRowIndex: 32-bit integer row indices.
  • LongRowIndex: 64-bit integer row indices.
  • Group: modulo of IntRowIndex and 5.
tUpdated = t.update("IntRowIndex = i", "LongRowIndex = ii", "Group = IntRowIndex % 5")

updateView creates formula columns. A formula column is one where only the formula is stored in memory when called. Results are then computed on the fly as needed.

The following code block adds two new columns to t.

  • IntRowIndex: 32-bit integer row indices.
  • Group: modulo of IntRowIndex and 5.
tUpdateViewed = t.updateView("IntRowIndex = i", "Group = IntRowIndex % 5")

lazyUpdate creates memoized columns. A memoized column performs calculations immediately and stores them in memory, but a calculation is only performed and stored once for each unique set of input values.

The following code block adds two new columns to tUpdated.

  • GroupSqrt: square root of Group.
  • GroupSquared: square of Group.

Because Group has only five unique values, only 10 calculations are needed to populate the two new columns.

tLazyUpdated = tUpdated.lazyUpdate("GroupSqrt = sqrt(Group)", "GroupSquared = Group * Group")

Select columns

select and view both create tables containing subsets of input columns and new columns computed from input columns. The big difference between the methods is how much memory they allocate and when formulas are evaluated. Performance and memory considerations dictate the best method for a particular use case.

select, like update, creates in-memory columns. The following code block selects two existing columns from t_updated and adds a new column.

tSelected = tUpdated.select("Timestamp", "IntRowIndex", "RowPlusOne = IntRowIndex + 1")

view, like updateView, creates formula columns. The following code block selects two existing columns from tUpdated and adds a new column.

tViewed = tUpdated.view("Timestamp", "LongRowIndex", "RowPlusOne = LongRowIndex + 1")

Drop columns

dropColumns removes columns from a table.

tDropped = tUpdated.dropColumns("IntRowIndex", "LongRowIndex")

Alternatively, view and select can be used to remove columns by omission. Both tables created below drop IntRowIndex and LongRowIndex by not including them in the list of columns passed in.

tDroppedViaView = tUpdated.view("Timestamp", "Group")
tDroppedViaSelect = tUpdated.select("Timestamp", "Group")

There's a lot to consider when choosing between in-memory, formula, and memoized columns in queries. For more information, see choose the right selection method.

Filter

By condition

where filters rows from a table based on a condition. The following code block keeps only rows in tUpdated where IntRowIndex is an even number.

tFiltered = tUpdated.where("IntRowIndex % 2 == 0")

Multiple filters can be applied in a single statement.

tFiltered2 = tUpdated.where("Group == 3", "IntRowIndex % 2 == 0")

A where operation that applies multiple conditions keeps only data that meets all of the criteria set forth. To keep rows that meet one or more of the specified criteria, use where with the or operator(||).

tFiltered3 = tUpdated.where("Group == 3 || IntRowIndex % 2 == 0")

In addition to where, Deephaven offers whereIn and whereNotIn, which filter table data based on another table.

See the filtering guide for more information.

By row position

Methods that filter data by row position only keep a portion of data at the top, middle, or bottom of a table.

  • head: Keep the first n rows.
  • headPct: Keep the first n% of rows.
  • tail: Keep the last n rows.
  • tailPct: Keep the last n% of rows.
  • slice: Keep rows between start and end.
  • slicePct: Keep rows between start% and end%.
// 10 rows at the top
tHead = tUpdated.head(10)

// 10% of rows at the top
tHeadPct = tUpdated.headPct(0.1)

// 15 rows at the bottom
tTail = tUpdated.tail(15)

// 15% of rows at the bottom
tTailPct = tUpdated.tailPct(0.15)

// 20 rows in the middle
tSlice = tUpdated.slice(40, 60)

// 20% of rows in the middle
tSlicePct = tUpdated.slicePct(0.4, 0.6)

See the filtering guide for more information.

Sort

The sort method sorts a table based on one or more columns. The following code block sorts tStatic by X in ascending order.

tSorted = tUpdated.sort("Group")

Tables can be sorted by more than one column.

tSortedMultiple = tUpdated.sort("Group", "IntRowIndex")

To sort in descending order, use sortDescending.

tSortedDesc = tUpdated.sortDescending("LongRowIndex")

To sort multiple columns in different directions, use sort with SortColumn's asc and desc methods.

sortColumns = [
SortColumn.asc(ColumnName.of("Group")),
SortColumn.desc(ColumnName.of("IntRowIndex"))
]

tSortMulti = tUpdated.sort(sortColumns)

See the sorting guide for more information.

Group and aggregate data

Grouping data places rows into groups based on zero or more supplied key columns. Aggregation calculates summary statistics over a group of data. Grouping and aggregation are key components of data analysis, especially in Deephaven queries.

The examples in this section will use the following table.

t = emptyTable(100).update(
"Timestamp = '2015-01-01T00:00:00 ET' + 'PT1m' * ii",
"X = randomInt(0, 100)",
"Y = randomDouble(0, 10)",
"Group = i % 5",
"Letter = (i % 2 == 0) ? `A` : `B`",
)

Group and ungroup data

groupBy groups table data into arrays. Entire tables can be grouped.

tGrouped = t.groupBy()

Data can be grouped by one or more key columns.

tGroupedByGroup = t.groupBy("Group")
tGroupedByMultiple = t.groupBy("Group", "Letter")

ungroup is the inverse of groupBy.

tUngrouped = tGrouped.ungroup()
tUngrouped2 = tGroupedByGroup.ungroup()
tUngrouped3 = tGroupedByMultiple.ungroup()

See the grouping and ungrouping guide for more information.

Single aggregations

Single aggregations apply a single aggregation to an entire table. See here for a list of single aggregators.

The following code uses avgBy to calculate the aggregated average of columns X and Y from the table t. No grouping columns are given, so the averages are calculated over the entire table.

tAvg = t.view("X", "Y").avgBy()

Aggregations are often calculated for groups of data. The following example calculates the average of X and Y for each unique value in Group.

tAvgByGroup = t.view("Group", "X", "Y").avgBy("Group")

Single aggregations can be performed using multiple grouping columns.

tAvgByMultiple = t.view("Group", "Letter", "X", "Y").avgBy("Group", "Letter")

Multiple aggregations

To apply multiple aggregations in a single operation, pass one or more of the aggregators into aggBy.

The following code block calculates the average of X and the median of Y, grouped by Group and Letter. It renames the resultant columns AvgX and MedianY, respectively.

import static io.deephaven.api.agg.Aggregation.AggAvg
import static io.deephaven.api.agg.Aggregation.AggMed

aggList = [AggAvg("AvgX = X"), AggMed("MedianY = Y")]

tMultipleAggs = t.view("Group", "Letter", "X", "Y").aggBy(
aggList, "Group", "Letter")

Rolling aggregations

Most platforms offer aggregation functionality similar to the dedicated and multiple aggregations presented above (though none will work so easily on real-time data). However, Deephaven is unique and powerful in its vast library of cumulative, moving, and windowed calculations, facilitated by the updateBy table operation and the io.deephaven.api.updateby module.

The following code block calculates the cumulative sum of X in t.

tCumSum = t.view("X").updateBy(CumSum("SumX = X"))

Aggregations with updateBy show the running total as it progresses through the table.

updateBy can also limit these summary statistics to subsets of table data defined by a number of rows or amount of time backward, forward, or both. The following code block calculates the sum of the prior 10 rows in column X of table t.

tWindowedSum = t.view("X").updateBy(RollingSum(10, "TenRowSumX = X"))

updateBy also supports performing these calculations over groups of data. The following code block performs the same calculations as above, but groups the data by Group and Letter.

updateByOps = [RollingSum(10, "TenRowSumX = X"), CumSum("SumX = X")]

tUpdatedByGrouped = t.updateBy(updateByOps, "Group", "Letter")

Additionally, calculations can be windowed by time. The following code block calculates a 16-second rolling average of X, grouped by Group.

tRollingAvgTime = t.updateBy(RollingAvg("Timestamp", parseDuration("PT16s"), "AvgX = X"), "Group")

Windows can look backward, forward, or both ways. The following example calculates the rolling average of the following windows:

  • The previous 9 seconds.
  • The current row and the previous 8 rows.
  • The current row, the previous 10 rows, and the next 10 rows.
  • The next 8 rows.
updateByOps = [
RollingAvg("Timestamp", parseDuration("PT9s"), "BackwardTimeAvgX = X"),
RollingAvg(9, "BackwardRowAvgX = X"),
RollingAvg(11, 10, "CenteredRowAvgX = X"),
RollingAvg(0, 8, "ForwardRowAvgX = X"),
]

tWindowed = t.updateBy(updateByOps)

NOTE: A backward-looking window counts the current row as the first row backward. A forward-looking window counts the row ahead of the current row as the first row forward.

See the updateBy user guide to learn more.

Combine tables

There are two different ways to combine tables in Deephaven: merging and joining. Merging tables can be visualized as a vertical stacking of tables, whereas joining is more horizontal in nature, appending rows from one table to another based on common columns.

Each subsection below defines its own tables to demonstrate merging and joining tables in Deephaven.

Merge tables

merge combines an arbitrary number of tables provided they all have the same schema.

t1 = emptyTable(10).update("Table = 1", "X = randomDouble(0, 10)", "Y = randomBool()")
t2 = emptyTable(6).update("Table = 2", "X = 1.1 * i", "Y = randomBool()")
t3 = emptyTable(3).update("Table = 3", "X = sin(0.1 * i)", "Y = true")

tMerged = merge(t1, t2, t3)

Join tables

Joining tables combines two tables based on one or more key columns. The key column(s) define data that is commonly shared between the two tables. The table on which the join operation is called is the left table, and the table passed as an argument is the right table.

Consider the following three tables.

t1 = newTable(
stringCol("Letter", "A", "B", "C", "B", "C", "F"),
intCol("Value", 5, 9, 19, 3, 11, 1),
booleanCol("Truth", true, true, true, false, false, true),
)

t2 = newTable(
stringCol("Letter", "C", "A", "B", "D", "E", "F"),
stringCol("Color", "Blue", "Blue", "Green", "Yellow", "Red", "Orange"),
intCol("Count", 35, 19, 12, 20, 26, 7),
intCol("Remaining", 5, 21, 28, 20, 14, 8),
)

t3 = newTable(
stringCol("Letter", "A", "E", "D", "B", "C"),
stringCol("Color", "Blue", "Red", "Yellow", "Green", "Black"),
intCol("Value", 5, 9, 19, 3, 11),
booleanCol("Truth", true, true, true, false, false),
)

The tables t1 and t2 have the common column Letter. Moreover, Letter contains matching values in both tables. So, these tables can be joined on the Letter column.

tJoined = t1.naturalJoin(t2, "Letter")

Joins can use more than one key column. The tables t2 and t3 have both the Letter and Color columns in common, and they both have matching values. The following code block joins the two tables on both columns.

tJoined2 = t2.naturalJoin(t3, "Letter, Color")

By default, every join operation in Deephaven appends all columns from the right table onto the left table. An optional third argument can be used to specify which columns to append. The following code block joins t2 and t3 on the Letter column, but only appends the Value column from t3.

tJoinedSubset = t2.naturalJoin(t3, "Letter", "Value")

t2 and t3 share the Color column, so any attempt to append that onto t2 results in a name conflict error. This can be avoided by either renaming the column, or by using the joins argument to specify which columns to append.

The following example renames Color in t3 to Color2 when joining the tables.

tJoinedRename = t2.naturalJoin(t3, "Letter", "Color2 = Color, Value")

Join operations in Deephaven come in two distinct flavors: exact and relational joins and time series and range joins.

Exact and relational joins

Exact and relational joins combine data from two tables based on exact matches in one or more related key columns.

Exact joins

Exact joins keep all rows from a left table, and append columns from a right table onto the left table.

Consider the following tables.

tLeft1 = newTable(
stringCol("Color", "Blue", "Magenta", "Yellow", "Magenta", "Beige", "Blue"),
intCol("Count", 5, 0, 2, 3, 7, 1),
)

tRight1 = newTable(
stringCol("Color", "Beige", "Yellow", "Blue", "Magenta", "Green"),
doubleCol("Weight", 2.3, 0.9, 1.4, 1.6, 3.0),
)

tLeft1 and tRight1 have a column of the same name and data type called Color. In tRight1, Color has no duplicates. Additionally, all colors in tLeft1 have an exact match in tRight1. In cases like this, exactJoin is the most appropriate join operation.

tExactJoined = tLeft1.exactJoin(tRight1, "Color")

Consider the following tables, which are similar to the previous example. However, in this case, tLeft2 contains the color Purple, which is not in tRight2.

tLeft2 = newTable(
stringCol("Color", "Blue", "Magenta", "Yellow", "Magenta", "Beige", "Green"),
intCol("Count", 5, 0, 2, 3, 7, 1),
)

tRight2 = newTable(
stringCol("Color", "Beige", "Yellow", "Blue", "Magenta", "Red"),
doubleCol("Weight", 2.3, 0.9, 1.4, 1.6, 3.0),
)

In this case, an exactJoin will fail. Instead, use naturalJoin, which appends a null value where no match exists.

tNaturalJoined = tLeft2.naturalJoin(tRight2, "Color")

Relational joins

Relational joins are similar to SQL joins.

Consider the following tables.

tLeft3 = newTable(
stringCol("Color", "Blue", "Yellow", "Magenta", "Beige", "Black"),
intCol("Count", 5, 2, 3, 7, 6),
)

tRight3 = newTable(
stringCol("Color", "Beige", "Yellow", "Blue", "Magenta", "Green", "Red", "Yellow", "Magenta"),
doubleCol("Weight", 2.3, 0.9, 1.4, 1.6, 3.0, 0.5, 1.1, 2.8),
)

join includes only rows where key columns in both tables contain an exact match, including multiple exact matches.

tJoined = tLeft3.join(tRight3, "Color")

leftOuterJoin includes all rows from the left table as well as rows from the right table where an exact match exists. Null values are inserted where no match exists.

import io.deephaven.engine.util.OuterJoinTools

tLeftOuterJoined = OuterJoinTools.leftOuterJoin(tLeft3, tRight3, "Color")

fullOuterJoin includes all rows from both tables, regardless of whether an exact match exists. Null values are inserted where no match exists.

import io.deephaven.engine.util.OuterJoinTools

tFullOuterJoined = OuterJoinTools.fullOuterJoin(tLeft3, tRight3, "Color")

Time-series (inexact) joins

Time-series (inexact) joins are joins where the key column(s) used to join the tables may not match exactly. Instead, the closest value is used to match the data when no exact match exists.

Consider the following tables, which contain quotes and trades for two different stocks.

trades = newTable(
stringCol("Ticker", "AAPL", "AAPL", "AAPL", "IBM", "IBM"),
instantCol(
"Timestamp",
parseInstant("2021-04-05T09:10:00 ET"),
parseInstant("2021-04-05T09:31:00 ET"),
parseInstant("2021-04-05T16:00:00 ET"),
parseInstant("2021-04-05T16:00:00 ET"),
parseInstant("2021-04-05T16:30:00 ET"),
),
doubleCol("Price", 2.5, 3.7, 3.0, 100.50, 110),
intCol("Size", 52, 14, 73, 11, 6),
)

quotes = newTable(
stringCol("Ticker", "AAPL", "AAPL", "IBM", "IBM", "IBM"),
instantCol(
"Timestamp",
parseInstant("2021-04-05T09:11:00 ET"),
parseInstant("2021-04-05T09:30:00 ET"),
parseInstant("2021-04-05T16:00:00 ET"),
parseInstant("2021-04-05T16:30:00 ET"),
parseInstant("2021-04-05T17:00:00 ET"),
),
doubleCol("Bid", 2.5, 3.4, 97, 102, 108),
intCol("BidSize", 10, 20, 5, 13, 23),
doubleCol("Ask", 2.5, 3.4, 105, 110, 111),
intCol("AskSize", 83, 33, 47, 15, 5),
)

aj joins row values in the left table with the closest in the right table without going over. To see the quote at the time of a trade, use aj.

resultAj = trades.aj(quotes, "Ticker, Timestamp")

raj joins row values in the left table with the closest in the right table without going under. To see the first quote that comes after a trade, use raj.

resultRaj = trades.raj(quotes, "Ticker, Timestamp")

More about joins

Every join operation presented in this notebook works on real-time data. Don't believe us? Try it for yourself! For more information about joins, see: