Troubleshoot query performance

In the previous section, we discussed Deephaven's query performance monitoring tools, which allow users to monitor and analyze query performance. In this section, we discuss some common causes of performance issues with real-time data and steps you can take to resolve these issues.

Steps for Reducing Ratio

The Ratio value shows what percentage of a query's data refresh cycle is consumed by each operation. Reducing the ratio value requires reducing the amount of computation performed by the query. Steps to do so may include the following:

Using update and select

Storing the results of complex calculations using update or select instead of update_view and view may reduce the ratio because when the former methods are used in a query, Deephaven will access and evaluate the data requested and then store it in memory. Storing the data is more efficient when the content is expensive to evaluate and/or when the data is accessed multiple times. However, keep in mind that storing large datasets requires a large amount of memory.

Using snapshots

Using snapshots reduces the frequency of updates (see Over-Ticking). Similar to update and select, snapshots store data into memory.

Reordering join operations

Reordering join operations to limit propagation of row change notifications (see Over-Ticking).

Efficient filtering

Ordering your filters carefully. When your table size grows to thousands or millions of rows (or more), you will want to ensure you are filtering the data in the most efficient method to reduce compute expense and execution time. Filters are applied from left to right. Therefore, the order in which they are passed to the function can have a substantial impact on the execution time.

Furthermore, conditional filters are not optimized like match filters. They should be placed after match filters in a given where clause. See also Composing Complex Where Clauses.

Using the Query Scope

Variables in your formula are taken from columns or the query scope, which dictates what variables and methods are accessible within a query. Using variables in the query scope can reduce computation expense and avoid compilation, because if you use a variable rather than different literal values, then the code does not need to be reinterpreted (i.e., compiled multiple times).

The simplest way to use the query scope is to define variables and methods in the global binding, e.g.:

currentDate = "2020-01-01"

Consider that the following calls require a compilation for each formula because a constant is used:

  • .where("Date.startswith(`2018-01`)"
  • .where("Date.startswith(`2018-02`)")
  • .where("Date.startswith(`2018-03`)")

However, by defining a variable in the query scope, e.g. monthStart="2018-01", the formula in the following calls remains the same and does not need to be recompiled, although the variable can be updated:

  • .where("Date.startsWith(monthStart)")
  • .where("Date.startsWith(monthStart)")

Common Causes of Performance Issues with Real-Time Data

Due to the inherent flexibility of Deephaven queries, there are many potential sources of performance issues. While the root causes can vary greatly, such issues typically fit into two broad categories: insufficient memory and over-ticking.

Insufficient Memory

Every worker has a predefined maximum amount of memory. When a worker's memory usage approaches its configured maximum, the query's performance can degrade significantly. If the query requires more memory than the worker's configured maximum, it will crash.

When working with live data, the amount of memory required by a query will grow over time. The extent of this growth depends on the number and type of query operations involved, as well as the growth rate of the data itself.

While high memory usage is often unavoidable when working with large datasets, you can significantly reduce memory consumption by following two simple guidelines when writing queries:

  1. use view and update_view to reduce data stored in memory, and
  2. always filter tables as much as possible before joining data.

view and update_view

As explained above, some analyses run faster when data is stored in memory. Columns defined with select and update are immediately calculated and stored in memory, causing memory requirements to vary based on the number of rows of data. If insufficient memory is a problem, use view and update_view instead. When using view and update_view, only the formulas are stored in memory — the results of the formula are never stored, and are simply calculated on demand when needed. This drastically reduces memory requirements and improves response time, particularly when working with large tables.

Filter before joining data

Filter tables as much as possible before joining. The size of a join operation's state depends on the number of rows in the respective left and right tables, as well as the number of unique keys in those tables. Reducing the number of rows in a table before joining it can significantly reduce the amount of memory required by the join.

Over-Ticking

"Over-ticking" refers to instances where a query must process data changes (also called "ticks") at a higher rate than the worker's CPU can process.

While every CPU has a limit, there are steps you can take when designing a query to reduce the number of ticks that will occur. When working with large datasets, these steps can reduce the number of ticks that occur by many orders of magnitude.

Use Snapshots

The simplest way of reducing the rate of ticks is to limit how often a table refreshes. With time tables and snapshots, you can explicitly set the interval at which a table updates to reflect changing data.

A time table is a special type of table that adds new rows at a regular, user-defined interval. The sole column of a timeTable is Timestamp. The snapshot operation produces an in-memory copy of a table (the "target table"), which refreshes every time another table (the "trigger table", which is often a time table) ticks.

Example

The following line of code is an example of a time table that ticks (adding a new row) every five seconds:

timeTable = timeTable("PT00:00:05")
from deephaven import time_table

timeTable = time_table("PT00:00:05")

The following line of code demonstrates how to use timeTable as the trigger table to create a snapshot of a target table, quotes. The value from the Timestamp column in the newest row of timeTable will be included in all rows of the resulting quotesSnapshot table.

quotesSnapshot = timeTable.snapshotWhen(quotes)
quotesSnapshot = timeTable.snapshot_when(quotes)

Whenever timeTable ticks, in this case, every five seconds, quotesSnapshot will refresh to reflect the current version of the quotes table. The query engine will recalculate the values for all cells in the quotes table (whether the rows have ticked or not) and store them in memory; further, downstream table operations will process ticks for all rows in the quotesSnapshot table.

This will incur the cost of running all deferred calculations from view and update_iew operations, as well as updating the state of downstream operations (such as joins and aggregations). However, quotesSnapshot will not change again until the next tick in timeTable, regardless of how much the quotes table changes in the interim. By replacing all ticks in quotes with a user-defined five-second interval, the frequency at which quotes ticks will no longer significantly affect performance.

Incremental Snapshots

The standard snapshot operation will recalculate values for all rows in the target table, even if those rows in the target table have not ticked. Additionally, it will force downstream operations to process a tick for all rows. This is often desirable when a query's calculations depend on time, such as formulas referencing a current time method or a timestamp from a time table.

In other cases, not all formulas must be refreshed at a regular interval. In these situations, it may be more expedient to refresh only the rows in the result table for which the corresponding rows in the target table have ticked. This can be accomplished with incremental snapshots.

With incremental snapshots, the query engine will track which rows of the target table have changed between ticks of the trigger table. When the trigger table ticks, the query engine will only update the result table for the rows that changed in the target table, reducing the amount of processing that must occur.

Since there is an added cost to tracking ticks in the target table, snapshot should be preferred over incremental snapshots, unless it is likely that only a portion of the target table changes between ticks of the trigger table. (This is rarely the case for tables containing market data.)

Below is an example of incremental snapshot usage using the timeTable from previous examples:

import io.deephaven.api.snapshot.SnapshotWhenOptions
snapshotOpts = SnapshotWhenOptions.of(false, true, false)

quotesSnapshot = timeTable.snapshotWhen(quotes, snapshotOpts)
quotesSnapshot = timeTable.snapshot_when(trigger_table=quotes, incremental=True)

The fundamental difference between a regular snapshot and an incremental snapshot is that a regular snapshot is applied to all rows regardless of changes, whereas an incremental snapshot applies only to the changed rows (i.e. added or modified).

Use Dynamic Filters – where_in and where_not_in

Another technique for reducing ticks is to apply where_in and where_not_in filters. These methods allow you to filter one table based on the values contained in another table. By defining one table with values of interest — such as the highest-volume stocks, or all stocks traded in a particular account — you can often filter out a large portion of the source data tables. When the values in the values-of-interest table change, the result of the dynamic where filter will update. The changes will then propagate through downstream query operations.

Order Joins to Reduce Row Change Notifications

When ordering joins in your query, make sure that live tables that add rows the most frequently are the last ones to be joined.

Example

If you have a table of historical stock prices, your table never "ticks" or updates. So, if you run a formula like .update("PriceSquared = HistPrice * HistPrice") it only has to be evaluated once. However, if you are working with ticking intraday tables, the order matters because a formula may need to be re-evaluated every time a table ticks. For example, let's say you have two tables: myLastTrades shows the last price at which you traded a stock, and mktLastTrades shows the last price at which anyone traded a stock in the entire stock market. Since the mktLastTrades table accounts for everyone who is trading, not just you, it will tick much more frequently.

Let's take a look at two similar queries:

tradePriceDiffs = histPrices
 .naturalJoin(mktLastTrades, "Sym", "MktLast=Price")
 .naturalJoin(myLastTrades, "Sym", "MyLast=Price")
 .update("MyChangeSinceYest = MyLast - HistPrice")
tradePriceDiffs = (
    histPrices.natural_join(mktLastTrades, "Sym", "MktLast=Price")
    .natural_join(myLastTrades, "Sym", "MyLast=Price")
    .update("MyChangeSinceYest = MyLast - HistPrice")
)
tradePriceDiffs = histPrices
 .naturalJoin(myLastTrades, "Sym", "MyLast=Price")
 .update("MyChangeSinceYest = MyLast - HistPrice")
 .naturalJoin(mktLastTrades, "Sym", "MktLast=Price")
tradePriceDiffs = (
    histPrices.natural_join(myLastTrades, "Sym", "MyLast=Price")
    .update("MyChangeSinceYest = MyLast - HistPrice")
    .natural_join(mktLastTrades, "Sym", "MktLast=Price")
)

In the first case, the update operation comes after joining in the data from mktLastTrades, which means the row for a given Sym will be marked as modified to the update operation not only when myLastTrades ticks, but also when mktLastTrades ticks — meaning there is extra processing related to re-evaluating the formula for MyChangeSinceYest for a Sym every time anyone in the U.S. trades that stock.

Any time a row is modified, Deephaven checks if the columns used in formulas were modified, and if they were, re-evaluates said formulas for said modified rows. That means that, in the second case, the update operation only bothers checking if the columns in its formula were modified when Sym ticks in myLastTrades — that is, only when you trade it. If the Price changes in mktLastTrades, the update operation does not have to apply this check, since that data is joined on afterward.

Avoid Tick Expansion

Tick expansion occurs when a tick in one table triggers multiple ticks in another table. As a result, if every row in one table depends on the same row of another table, every single row in the first table will have to be evaluated if the row in the second table changes.

Example

currentTime = timeTable("PT00:00:01").lastBy()
stockTrades = db.liveTable("LearnDeephaven", "StockTrades").where("Date=today()")
timeSinceTrade = stockTrades.naturalJoin(currentTime, "", "Now=Timestamp").update("TimeSinceTrade = Timestamp - Now")
from deephaven import time_table

currentTime = time_table("PT00:00:01").last_by()
stockTrades = db.live_table("LearnDeephaven", "StockTrades").where("Date=today()")
timeSinceTrade = stockTrades.natural_join(currentTime, "", "Now=Timestamp").update(
    "TimeSinceTrade = Timestamp - Now"
)

The currentTime table contains one column (Timestamp) and one row, which will update every second. This value is joined onto every row in stockTrades as the first step of producing the timeSinceTrade table. After joining the tables, changes in currentTime — which occur every second — will cause data to change in every row of timeSinceTrade. Since every row will register changes each second, the formula for TimeElapsed will need to be reevaluated for every row. This constitutes tick expansion because a single tick in one table (currentTime) leads to many ticks in another table (timeSinceTrade).

Use of group_by and ungroup

Tick expansion is most common when using group_by and ungroup. Because group_by produces an array of values for each key, the entire array for a key must update any time any value with that key changes.

Note

The other aggregations, including arithmetic and indexing aggregators such as sum_by, avg_by, std_by, first_by, and last_by, do not incur this performance penalty.

For example, when grouping data by stock ticker symbol, a table created by calling .groupBy("Symbol") will tick for a given symbol whenever any row with the same symbol ticks in the source table.

Example

trades = db.liveTable("LearnDeephaven", "StockTrades").where("Date=today()").view("Symbol=Sym", "Price=Last", "Size")
tradesBySym = trades.groupBy("Symbol")
tradesUngrouped = tradesBySym.ungroup()
tradesWithDollars_1 = trades.update("Dollars=Price*Size")
tradesWithDollars_2 = tradesUngrouped.update("Dollars=Price*Size")
trades = (
    db.live_table("LearnDeephaven", "StockTrades")
    .where("Date=today()")
    .view(["Symbol=Sym", "Price=Last", "Size"])
)
tradesBySym = trades.group_by("Symbol")
tradesUngrouped = tradesBySym.ungroup()
tradesWithDollars_1 = trades.update("Dollars=Price*Size")
tradesWithDollars_2 = tradesUngrouped.update("Dollars=Price*Size")

The tradesUngrouped table will contain the same rows as the trades table (though they will be in a different order). Likewise, tradesWithDollars_1 and tradesWithDollars_2 will contain the same data. However, despite their similarities, the two tables have very different performance profiles.

When a new row is appended to the source StockTrades table, and thus the trades table in the query, all other rows in that table remain unchanged — the new row is the only "tick" to be processed. From here, the paths to tradesWithDollars_1 and tradesWithDollars_2 diverge.

Updating the tradesWithDollars_1 table requires evaluating the formula for Dollars only once, since the new row is the only tick in trades.

Updating tradesBySym is more work. Each row of tradesBySym contains arrays of values for each Symbol. When a new row is appended to trades, new arrays must be created for the Price and Size columns, featuring data from the new row. Recreating these arrays requires revisiting every row in trades that has the same key (Symbol) as the new row. This constitutes tick expansion because a tick in one row has the same effect as a tick in all rows with the same key.

Tick expansion propagates further when ungrouping tradesBySym to create tradesUngrouped. Each time a row in tradesBySym ticks, all rows affected by ungrouping that row will tick as well. As a result, when producing tradesWithDollars_2, a single new row in StockTrades leads the formula for Dollars to be rerun for every row with the same Symbol as the new row.

The difference is stark. While tradesWithDollars_1 will process updates in constant time (the amount of processing per new row is always the same), the use of group_by and ungroup means the cost of updating tradesWithDollars_2 will increase with the number of rows for each Symbol.

Care should be taken when using the join operations on live data, since both perform a group_by on the right table. The join operation will perform an ungroup in addition to a group_by.