Combined aggregations for efficient analysis

Dealing with massive amounts of data is not always straightforward. Often, a data scientist needs to trim down a large data table into a handful of key components that can be used for further analysis. For example, we may want to condense a big table of stock quotes from a specific day to show only the highest, lowest, opening, and closing prices for each stock symbol.

Streaming data like this could scale up to hundreds of millions of rows across a full year. Your computer doesn’t like this, so you must find a way to trim it down if you want a timely analysis.

This can be done using traditional filters and joins, but Deephaven offers a unique solution: combined aggregators.

In Practice

For this problem, we import "CryptoTrades_20210922.csv" as a Deephaven table and construct a new table with OHLC prices for each instrument on each day. First, let's import the data and separate the Timestamp column into the Date and Time columns.

from deephaven import read_csv

# read in the CSV
crypto = read_csv("/data/examples/CryptoCurrencyHistory/CSV/CryptoTrades_20210922.csv")

# Split the instant into a local date and time, then reorganize the table
crypto = (
    crypto.update(
        ["Date = toLocalDate(Timestamp, 'ET')", "Time = toLocalTime(Timestamp, 'ET')"]
    )
    .drop_columns("Timestamp")
    .move_columns_up(["Date", "Time"])
)

# Filter out erroneous outlier data
crypto = crypto.where("Price > 1")

This dataset is a million rows strong. Streaming data like this could scale up to hundreds of millions of rows across a full year, and trimming this data is necessary for timely analysis. If you want to get the OHLC prices for each instrument within a particular day, you could use the following code, which makes use of traditional filters and joins:

crypto_hloc = (
    crypto.max_by("Instrument")
    .natural_join(crypto.min_by("Instrument"), "Instrument", "Lo = Price")
    .natural_join(crypto.first_by("Instrument"), "Instrument", "Open = Price")
    .natural_join(crypto.last_by("Instrument"), "Instrument", "Close = Price")
)

While logically correct, this code is quite clunky, and all of the aggregation filters and joins may take a while (only nanoseconds, but every nanosecond counts with millions of rows). This type of problem is common for traders, so fortunately, we’ve developed a cleaner, quicker system of aggregating your data -- combined aggregators. With this new approach, we can rephrase the above code block:

from deephaven import agg as agg

crypto_hloc_combined = crypto.agg_by(
    [
        agg.max_(["Hi = Price"]),
        agg.min_(["Lo = Price"]),
        agg.first(["Open = Price"]),
        agg.last(["Close = Price"]),
    ],
    "Instrument",
)

This code is much easier to read and much faster than the previous block because of the dedicated functionality.

Conclusion

While logic is the most fundamental tool in any programmer’s or data scientist’s toolkit, good code is not only a logically coherent solution to the problem at hand. Good code is elegant, fast, readable, and makes use of any kind of dedicated functionality available. Deephaven has already designed efficient solutions to common problems for the sake of good code, so it would be a mistake to ignore them. Combined aggregators are just one of the many such solutions native to Deephaven that separate it from the pack. Hopefully, this short example demonstrates the effectiveness of these solutions in a real-world application.

Using Deephaven's Combined Aggregators

In Practice

Conclusion