Group data and perform aggregations

One of the powerful capabilities of any database is its ability to quickly associate similar data and perform various aggregations on respective subsets.

Grouping and aggregation are fundamental operations.

Grouping data

Data within a table can be grouped based on any similar attribute, or key.

  • “I want to group a tradelog by Account... or by Symbol... or by Account & Symbol.”
  • “I want to group signals by Factor... or by PM-Group & Factor.”
  • “I want to group slippage calculations by Sector, Vintage, and Exchange destination.”

In Deephaven, one needs only to use the by() method. Very simply:

sourceTable.by(“GroupingKey”)

When a user groups data, all fields other than the grouping column will be presented as arrays of the data from the corresponding rows in the source.

Example

Create a simple table of 5 records for each of 3 Symbols:

table = db.t("LearnDeephaven", "StockTrades")\
     .where("USym in `AAPL`, `MSFT`, `GOOG`")\
     .headBy(5, "USym")\
     .view("USym", "Exchange", "Last", "Size")

img

Group the data by USym:

groupUSym = table.by("USym")

img

Group the data by USym and Exchange:

groupUSymExchange = table.by("USym", "Exchange")

img

Un-grouping data

As you would expect, ungroup() performs the opposite operation. It scours the table for any array fields and expands them into a respective row for each array element. All other columns (that are not arrays) will remain the same in the expanded table.

Example

Create a single-row table:

newTable = emptyTable(1).update("ParentStrategy = `Auto-Algo`",\
"Strategy = `Account_123`", "USym = new String[] {`TSLA`, `QQQ`, `AMD`, `BA`}",\
"Position = new int[] {100, 200, -300, 400}",\
"PNL = new double[] {12.12, -11.11, -10.55, 99.99}")

img

Ungroup that newTable to create a row for each data record.

Caution

The array sizes must be the same for respective columns.

ungroupedTable = newTable.ungroup()

img

To come full circle in these examples, the script below ungroups the data that we grouped earlier (remember, using a by()), bringing us back to the original 15-row table.

ungroupedIt = groupUSymExchange.ungroup()

img

Performing Aggregations

Aggregations are essentially an extension of grouping. They allow you to easily find sums, counts, variances, etc. of records within a group.

Use cases:

  • “I want the sum of TradeSize by Account... or by Symbol... or by Account & Symbol.” “I want the max of SignalMagnitude by Factor... or by PM-Group & Factor.” “I want the average of slippage by Sector, AgeOfTrade, and Exchange.”

Aggregation operations:

  1. group the values of multiple rows based on the key,
  2. return a calculation on the arrays of all the non-key columns in those groups.

There are two ways to perform aggregations programmatically:

  • Single, dedicated aggregations
  • Combined aggregations

The syntax for each is slightly different.

Since this is Deephaven, do know all aggregations update in real time as new (and relevant) data becomes available!

Single (dedicated) aggregators

Regular aggregations are simple once one has mastered the grouping method described above -- i.e., by(“SomeKey”).

The syntax for single aggregations is very similar:

agg = sourceTable.aggtypeBy("SomeKey")

aggType can mean:

  • first
  • last
  • min
  • max
  • sum
  • avg
  • var
  • std
  • median

Caution

Since the aggregation calculation will be performed on all non-key columns, math-driven methods (like sumBy, avgBy, etc.) will trigger an error if string, char, boolean types are represented in the non-key fields.

Let’s again use that 15-record table from the grouping example:

table = db.t("LearnDeephaven", "StockTrades")\
    .where("USym in `AAPL`, `MSFT`, `GOOG`")\
    .headBy(5, "USym").view("USym", "Exchange", "Last", "Size")

img

Calculating the sum of Last (price) and Size is straightforward:

sumExample = table.sumBy("USym", "Exchange")

Note that inside the parentheses one identifies the grouping keys, then all remaining columns are aggregated.

img

All of these other aggregation operations would produce the outcome you’d expect:

  • firstExample = table.firstBy("USym", "Exchange")
  • lastExample = table.lastBy("USym", "Exchange")
  • minExample = table.minBy("USym", "Exchange")
  • maxExample = table.maxBy("USym", "Exchange")
  • avgExample = table.avgBy("USym", "Exchange")
  • medianExample = table.medianBy("USym", "Exchange")
  • stdExample = table.stdBy("USym", "Exchange")

Here is the result of varBy(), shown to articulate how single-array variances are calculated.

img

Nulls and NaNs

Nulls are generally ignored in the aggregations. So, for example, a null record will not taint the avgBy() calculation.

NaNs:

  • NaNs in the input generally pollute the output of the aggregation defining it to be a NaN.
  • Positive infinite has an average of positive infinite. Similarly with negative infinite.
  • Having both positive and negative infinities in the input produce an average of NaN.
  • There is a filter available to remove NaN’s: .where("!isNaN(FieldName)")

Good to Know: Keyless aggregations

To perform an aggregation over the whole table, one can use any of these aggBy() functions without identifying a key.

For example the max record for each column in (the above) table is:

img

Note that the USym and Exchange (string) columns return their alphabetical maximum.

Special Case: countBy()

Performing a no-key countBy() will cause an error. One needs to label a new Column name for the count. [Examples below use “N”.] These produce the output you see in the image:

noKeyCount = table.countBy("N")
usymCount = table.countBy("N", "USym")
exchCount = table.countBy("N", "Exchange")
usymExchCount = table.countBy("N", "USym", "Exchange")

img

Important

Very important to know related to countBy: The count aggregation simply counts rows. It does not in any way review the data. Therefore nulls and NaNs will be included in the count!

If you do not want nulls and NaNs to be included, simply add:

.updateView(“Bool = isNormal(FieldName)”)

followed by:

.countBy(“FieldName”, “Bool”)

More special cases: headBy() and tailBy()

As you would expect, this query

biggerTable = db.t("LearnDeephaven", "StockTrades")
headEx = biggerTable.headBy(10, "USym")
tailEx = biggerTable.tailBy(10, "USym", "Exchange")

produces a table (“headEx”) that is the first 10 rows from biggerTable for each respective USym, and a table (“tailEx”) that is the last 10 rows from "biggerTable" for each respective USym-Exchange combination.

Last special case: wavgBy()

To use the weighted average aggregation, you need to identify the weighting factor. This script gives the weighted average of Price by USym, with Size as the weighting factor.

wtedAvg = biggerTable.view("USym", "Last", "Size")\
.wavgBy("Size", "USym")

User-defined aggregations

You can create your own user-defined aggregators using applyToAllBy(), as documented here.

Combined Aggregations

Combined aggregations let you perform multiple operations at once.

The following syntax should be self-evident, producing the table below.

Important

To use these, one needs to import the Python module that supports them, delivering ComboAggregateFactory as caf.

from deephaven import *

biggerTable = db.t("LearnDeephaven", "StockTrades")

combinationAgg = biggerTable\
.updateView("Dollars = Last * Size")\
.by(caf.AggCombo(\
    caf.AggLast("LastPrice = Last","LastTimestamp = Timestamp"),\
    caf.AggCount("Size"),\
    caf.AggSum("Shares = Size", "Dollars"),\
    caf.AggFirst("OpenPrice = Last"),\
    caf.AggMax("HighPrice = Last"),\
    caf.AggMin("LowPrice = Last"),\
    caf.AggAvg("AvgPrice = Last"),\
    caf.AggWAvg("Size", "WtdAvgPrice = Last"),\
    caf.AggVar("VarPrice = Last"),\
    caf.AggStd("StdPrice = Last"),\
    caf.AggMed("MedianPrice = Last"),\
    caf.AggPct(0.75, "Perc75Price = Last"),\
), "Date", "USym")

img

A few things to note:

  • See the .by(caf.AggCombo(STUFF), "Key(s)") syntax.
  • The key(s) of the aggregation are identified at the end - in this case, "Date", "USym".
  • In using AggCombo, each field to be aggregated must be itemized.
  • A particular field from the source table can be represented in multiple aggregations.
  • You can rename the aggregation column easily.
  • Observe caf.AggPct as well, which is supported in comboAgg.