Troubleshoot query performance
In the previous section, we discussed Deephaven's query performance monitoring tools, which allow users to monitor and analyze query performance. In this section, we discuss some common causes of performance issues with real-time data and steps you can take to resolve these issues.
Steps for Reducing Ratio
The Ratio value shows what percentage of a query's data refresh cycle is consumed by each operation. Reducing the ratio value requires reducing the amount of computation performed by the query. Steps to do so may include the following:
Using update
and select
Storing the results of complex calculations using update
or select
instead of update_view
and view
may reduce the ratio because when the former methods are used in a query, Deephaven will access and evaluate the data requested and then store it in memory. Storing the data is more efficient when the content is expensive to evaluate and/or when the data is accessed multiple times. However, keep in mind that storing large datasets requires a large amount of memory.
Using snapshots
Using snapshots reduces the frequency of updates (see Over-Ticking). Similar to update
and select
, snapshots store data into memory.
Reordering join operations
Reordering join operations to limit propagation of row change notifications (see Over-Ticking).
Efficient filtering
Ordering your filters carefully. When your table size grows to thousands or millions of rows (or more), you will want to ensure you are filtering the data in the most efficient method to reduce compute expense and execution time. Filters are applied from left to right. Therefore, the order in which they are passed to the function can have a substantial impact on the execution time.
Furthermore, conditional filters are not optimized like match filters. They should be placed after match filters in a given where
clause. See also Composing Complex Where Clauses.
Using the Query Scope
Variables in your formula are taken from columns or the query scope, which dictates what variables and methods are accessible within a query. Using variables in the query scope can reduce computation expense and avoid compilation, because if you use a variable rather than different literal values, then the code does not need to be reinterpreted (i.e., compiled multiple times).
The simplest way to use the query scope is to define variables and methods in the global binding, e.g.:
currentDate = "2020-01-01"
Consider that the following calls require a compilation for each formula because a constant is used:
.where("Date.startswith(`2018-01`)"
.where("Date.startswith(`2018-02`)")
.where("Date.startswith(`2018-03`)")
However, by defining a variable in the query scope, e.g. monthStart="2018-01"
, the formula in the following calls remains the same and does not need to be recompiled, although the variable can be updated:
.where("Date.startsWith(monthStart)")
.where("Date.startsWith(monthStart)")
Common Causes of Performance Issues with Real-Time Data
Due to the inherent flexibility of Deephaven queries, there are many potential sources of performance issues. While the root causes can vary greatly, such issues typically fit into two broad categories: insufficient memory and over-ticking.
Insufficient Memory
Every worker has a predefined maximum amount of memory. When a worker's memory usage approaches its configured maximum, the query's performance can degrade significantly. If the query requires more memory than the worker's configured maximum, it will crash.
When working with live data, the amount of memory required by a query will grow over time. The extent of this growth depends on the number and type of query operations involved, as well as the growth rate of the data itself.
While high memory usage is often unavoidable when working with large datasets, you can significantly reduce memory consumption by following two simple guidelines when writing queries:
- use
view
andupdate_view
to reduce data stored in memory, and - always filter tables as much as possible before joining data.
view
and update_view
As explained above, some analyses run faster when data is stored in memory. Columns defined with select
and update
are immediately calculated and stored in memory, causing memory requirements to vary based on the number of rows of data. If insufficient memory is a problem, use view
and update_view
instead. When using view
and update_view
, only the formulas are stored in memory — the results of the formula are never stored, and are simply calculated on demand when needed. This drastically reduces memory requirements and improves response time, particularly when working with large tables.
Filter before joining data
Filter tables as much as possible before joining. The size of a join operation's state depends on the number of rows in the respective left and right tables, as well as the number of unique keys in those tables. Reducing the number of rows in a table before joining it can significantly reduce the amount of memory required by the join.
Over-Ticking
"Over-ticking" refers to instances where a query must process data changes (also called "ticks") at a higher rate than the worker's CPU can process.
While every CPU has a limit, there are steps you can take when designing a query to reduce the number of ticks that will occur. When working with large datasets, these steps can reduce the number of ticks that occur by many orders of magnitude.
Use Snapshots
The simplest way of reducing the rate of ticks is to limit how often a table refreshes. With time tables and snapshots, you can explicitly set the interval at which a table updates to reflect changing data.
A time table is a special type of table that adds new rows at a regular, user-defined interval. The sole column of a timeTable
is Timestamp. The snapshot operation produces an in-memory copy of a table (the "target table"), which refreshes every time another table (the "trigger table", which is often a time table) ticks.
Example
The following line of code is an example of a time table that ticks (adding a new row) every five seconds:
timeTable = timeTable("PT00:00:05")
from deephaven import time_table
timeTable = time_table("PT00:00:05")
The following line of code demonstrates how to use timeTable as the trigger table to create a snapshot of a target table, quotes
. The value from the Timestamp column in the newest row of timeTable will be included in all rows of the resulting quotesSnapshot
table.
quotesSnapshot = timeTable.snapshotWhen(quotes)
quotesSnapshot = timeTable.snapshot_when(quotes)
Whenever timeTable ticks, in this case, every five seconds, quotesSnapshot
will refresh to reflect the current version of the quotes
table. The query engine will recalculate the values for all cells in the quotes
table (whether the rows have ticked or not) and store them in memory; further, downstream table operations will process ticks for all rows in the quotesSnapshot
table.
This will incur the cost of running all deferred calculations from view
and update_iew
operations, as well as updating the state of downstream operations (such as joins and aggregations). However, quotesSnapshot
will not change again until the next tick in timeTable, regardless of how much the quotes
table changes in the interim. By replacing all ticks in quotes
with a user-defined five-second interval, the frequency at which quotes
ticks will no longer significantly affect performance.
Incremental Snapshots
The standard snapshot
operation will recalculate values for all rows in the target table, even if those rows in the target table have not ticked. Additionally, it will force downstream operations to process a tick for all rows. This is often desirable when a query's calculations depend on time, such as formulas referencing a current time method or a timestamp from a time table.
In other cases, not all formulas must be refreshed at a regular interval. In these situations, it may be more expedient to refresh only the rows in the result table for which the corresponding rows in the target table have ticked. This can be accomplished with incremental snapshots.
With incremental snapshots, the query engine will track which rows of the target table have changed between ticks of the trigger table. When the trigger table ticks, the query engine will only update the result table for the rows that changed in the target table, reducing the amount of processing that must occur.
Since there is an added cost to tracking ticks in the target table, snapshot
should be preferred over incremental snapshots, unless it is likely that only a portion of the target table changes between ticks of the trigger table. (This is rarely the case for tables containing market data.)
Below is an example of incremental snapshot usage using the timeTable
from previous examples:
import io.deephaven.api.snapshot.SnapshotWhenOptions
snapshotOpts = SnapshotWhenOptions.of(false, true, false)
quotesSnapshot = timeTable.snapshotWhen(quotes, snapshotOpts)
quotesSnapshot = timeTable.snapshot_when(trigger_table=quotes, incremental=True)
The fundamental difference between a regular snapshot and an incremental snapshot is that a regular snapshot is applied to all rows regardless of changes, whereas an incremental snapshot applies only to the changed rows (i.e. added or modified).
Use Dynamic Filters – where_in
and where_not_in
Another technique for reducing ticks is to apply where_in
and where_not_in
filters. These methods allow you to filter one table based on the values contained in another table. By defining one table with values of interest — such as the highest-volume stocks, or all stocks traded in a particular account — you can often filter out a large portion of the source data tables. When the values in the values-of-interest table change, the result of the dynamic where filter will update. The changes will then propagate through downstream query operations.
Note
See also:
where_in
and where_not_in
Order Joins to Reduce Row Change Notifications
When ordering joins in your query, make sure that live tables that add rows the most frequently are the last ones to be joined.
Example
If you have a table of historical stock prices, your table never "ticks" or updates. So, if you run a formula like .update("PriceSquared = HistPrice * HistPrice")
it only has to be evaluated once. However, if you are working with ticking intraday tables, the order matters because a formula may need to be re-evaluated every time a table ticks. For example, let's say you have two tables: myLastTrades
shows the last price at which you traded a stock, and mktLastTrades
shows the last price at which anyone traded a stock in the entire stock market. Since the mktLastTrades
table accounts for everyone who is trading, not just you, it will tick much more frequently.
Let's take a look at two similar queries:
tradePriceDiffs = histPrices
.naturalJoin(mktLastTrades, "Sym", "MktLast=Price")
.naturalJoin(myLastTrades, "Sym", "MyLast=Price")
.update("MyChangeSinceYest = MyLast - HistPrice")
tradePriceDiffs = (
histPrices.natural_join(mktLastTrades, "Sym", "MktLast=Price")
.natural_join(myLastTrades, "Sym", "MyLast=Price")
.update("MyChangeSinceYest = MyLast - HistPrice")
)
tradePriceDiffs = histPrices
.naturalJoin(myLastTrades, "Sym", "MyLast=Price")
.update("MyChangeSinceYest = MyLast - HistPrice")
.naturalJoin(mktLastTrades, "Sym", "MktLast=Price")
tradePriceDiffs = (
histPrices.natural_join(myLastTrades, "Sym", "MyLast=Price")
.update("MyChangeSinceYest = MyLast - HistPrice")
.natural_join(mktLastTrades, "Sym", "MktLast=Price")
)
In the first case, the update
operation comes after joining in the data from mktLastTrades
, which means the row for a given Sym will be marked as modified to the update operation not only when myLastTrades
ticks, but also when mktLastTrades
ticks — meaning there is extra processing related to re-evaluating the formula for MyChangeSinceYest for a Sym every time anyone in the U.S. trades that stock.
Any time a row is modified, Deephaven checks if the columns used in formulas were modified, and if they were, re-evaluates said formulas for said modified rows. That means that, in the second case, the update
operation only bothers checking if the columns in its formula were modified when Sym ticks in myLastTrades
— that is, only when you trade it. If the Price changes in mktLastTrades
, the update
operation does not have to apply this check, since that data is joined on afterward.
Avoid Tick Expansion
Tick expansion occurs when a tick in one table triggers multiple ticks in another table. As a result, if every row in one table depends on the same row of another table, every single row in the first table will have to be evaluated if the row in the second table changes.
Example
currentTime = timeTable("PT00:00:01").lastBy()
stockTrades = db.liveTable("LearnDeephaven", "StockTrades").where("Date=today()")
timeSinceTrade = stockTrades.naturalJoin(currentTime, "", "Now=Timestamp").update("TimeSinceTrade = Timestamp - Now")
from deephaven import time_table
currentTime = time_table("PT00:00:01").last_by()
stockTrades = db.live_table("LearnDeephaven", "StockTrades").where("Date=today()")
timeSinceTrade = stockTrades.natural_join(currentTime, "", "Now=Timestamp").update(
"TimeSinceTrade = Timestamp - Now"
)
The currentTime
table contains one column (Timestamp) and one row, which will update every second. This value is joined onto every row in stockTrades
as the first step of producing the timeSinceTrade
table. After joining the tables, changes in currentTime
— which occur every second — will cause data to change in every row of timeSinceTrade
. Since every row will register changes each second, the formula for TimeElapsed will need to be reevaluated for every row. This constitutes tick expansion because a single tick in one table (currentTime
) leads to many ticks in another table (timeSinceTrade
).
Use of group_by
and ungroup
Tick expansion is most common when using group_by
and ungroup
. Because group_by
produces an array of values for each key, the entire array for a key must update any time any value with that key changes.
Note
The other aggregations, including arithmetic and indexing aggregators such as sum_by
, avg_by
, std_by
, first_by
, and last_by
, do not incur this performance penalty.
For example, when grouping data by stock ticker symbol, a table created by calling .groupBy("Symbol")
will tick for a given symbol whenever any row with the same symbol ticks in the source table.
Example
trades = db.liveTable("LearnDeephaven", "StockTrades").where("Date=today()").view("Symbol=Sym", "Price=Last", "Size")
tradesBySym = trades.groupBy("Symbol")
tradesUngrouped = tradesBySym.ungroup()
tradesWithDollars_1 = trades.update("Dollars=Price*Size")
tradesWithDollars_2 = tradesUngrouped.update("Dollars=Price*Size")
trades = (
db.live_table("LearnDeephaven", "StockTrades")
.where("Date=today()")
.view(["Symbol=Sym", "Price=Last", "Size"])
)
tradesBySym = trades.group_by("Symbol")
tradesUngrouped = tradesBySym.ungroup()
tradesWithDollars_1 = trades.update("Dollars=Price*Size")
tradesWithDollars_2 = tradesUngrouped.update("Dollars=Price*Size")
The tradesUngrouped
table will contain the same rows as the trades
table (though they will be in a different order). Likewise, tradesWithDollars_1
and tradesWithDollars_2
will contain the same data. However, despite their similarities, the two tables have very different performance profiles.
When a new row is appended to the source StockTrades table, and thus the trades
table in the query, all other rows in that table remain unchanged — the new row is the only "tick" to be processed. From here, the paths to tradesWithDollars_1
and tradesWithDollars_2
diverge.
Updating the tradesWithDollars_1
table requires evaluating the formula for Dollars only once, since the new row is the only tick in trades
.
Updating tradesBySym
is more work. Each row of tradesBySym
contains arrays of values for each Symbol. When a new row is appended to trades
, new arrays must be created for the Price and Size columns, featuring data from the new row. Recreating these arrays requires revisiting every row in trades
that has the same key (Symbol) as the new row. This constitutes tick expansion because a tick in one row has the same effect as a tick in all rows with the same key.
Tick expansion propagates further when ungrouping tradesBySym
to create tradesUngrouped
. Each time a row in tradesBySym
ticks, all rows affected by ungrouping that row will tick as well. As a result, when producing tradesWithDollars_2
, a single new row in StockTrades leads the formula for Dollars to be rerun for every row with the same Symbol as the new row.
The difference is stark. While tradesWithDollars_1
will process updates in constant time (the amount of processing per new row is always the same), the use of group_by
and ungroup
means the cost of updating tradesWithDollars_2
will increase with the number of rows for each Symbol.
Care should be taken when using the join
operations on live data, since both perform a group_by
on the right table. The join
operation will perform an ungroup
in addition to a group_by
.