A simple (but not obvious) way to improve query performance

Every programming language has unique ways to improve efficiency. However, some strategies are universal -- for instance, not performing the same calculation any more than necessary.

Eliminate unnecessary calculations in Deephaven with a simple rule of thumb.

The engine

Deephaven tables are powerful. The engine has optimized many of the operations users can perform to be lightning-fast without any legwork. Nevertheless, there are still some considerations that need to be made to ensure queries run efficiently. For instance, using built-in query language methods in query strings is faster than using Python methods.

The query below uses built-in query language methods. Still, it could be improved with a simple fix. See if you can spot how.

from deephaven import read_csv

crypto_trades = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/CryptoTrades_20210922.csv")

trades_after_3 = crypto_trades.where("toLocalTime(Timestamp, timeZone(`ET`)) >= parseLocalTime(`15:00:00`)")

In the example above, the slowdown occurs on the right-hand side of the query string in the where filter. In the case of this query, parseLocalTime() is being called for every row in the table, despite the fact that it is always given the same input and returns the same value. So, this query is performing the same calculation over, and over, and over, and... well, you get the point.

Thankfully, the fix is simple. Deephaven's new time library allows the local time to be compared to a string enclosed by single quotes (').

from deephaven import read_csv

crypto_trades = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/CryptoTrades_20210922.csv")

trades_after_3 = crypto_trades.where("toLocalTime(Timestamp, timeZone(`ET`)) >= '15:00:00'")

By using a string literal, rather than a method call, the 15:00:00 gets parsed exactly once, and the resulting variable is used to calculate all rows.

Just how much faster is the second query than the first? Let's find out by timing each.

from deephaven import read_csv
from time import time

crypto_trades = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/CryptoTrades_20210922.csv")

start = time()
for idx in range(100):
    trades_after_3 = crypto_trades.where("toLocalTime(Timestamp, timeZone(`ET`)) >= parseLocalTime(`15:00:00`)")
end = time()

print(f"Compute every time: {((end - start)/100):3f} seconds.")

from deephaven import read_csv
from time import time

crypto_trades = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/CryptoTrades_20210922.csv")

start = time()
for idx in range(100):
    trades_after_3 = crypto_trades.where("toLocalTime(Timestamp, timeZone(`ET`)) >= '15:00:00'")
end = time()

print(f"Compare against string: {((end - start)/100):.3f} seconds.")

That's more than 3x faster!

That kind of performance difference could make or break a query.

The rule of thumb

Remember this rule of thumb:

If a calculation run in a query string will always return the same value, use a constant instead.

Whether you want to use a string, or the Python method outside of a query string, your query will always be faster.

This holds true for all of Deephaven's APIs.

Reach out

Have any questions for us? Reach out on Slack!

Avoid unnecessary recalculations in query strings

The engine

The rule of thumb

Reach out