Every programming language has unique ways to improve efficiency. However, some strategies are universal -- for instance, not performing the same calculation any more than necessary.
The engine
Deephaven tables are powerful. The engine has optimized many of the operations users can perform to be lightning-fast without any legwork. Nevertheless, there are still some considerations that need to be made to ensure queries run efficiently. For instance, using built-in query language methods in query strings is faster than using Python methods.
The query below uses built-in query language methods. Still, it could be improved with a simple fix. See if you can spot how.
from deephaven import read_csv
crypto_trades = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/CryptoTrades_20210922.csv")
trades_after_3 = crypto_trades.where("toLocalTime(Timestamp, timeZone(`ET`)) >= parseLocalTime(`15:00:00`)")
- crypto_trades
- trades_after_3
In the example above, the slowdown occurs on the right-hand side of the query string in the where
filter. In the case of this query, parseLocalTime()
is being called for every row in the table, despite the fact that it is always given the same input and returns the same value. So, this query is performing the same calculation over, and over, and over, and... well, you get the point.
Thankfully, the fix is simple. Deephaven's new time library allows the local time to be compared to a string enclosed by single quotes ('
).
from deephaven import read_csv
crypto_trades = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/CryptoTrades_20210922.csv")
trades_after_3 = crypto_trades.where("toLocalTime(Timestamp, timeZone(`ET`)) >= '15:00:00'")
- crypto_trades
- trades_after_3
By using a string literal, rather than a method call, the 15:00:00
gets parsed exactly once, and the resulting variable is used to calculate all rows.
Just how much faster is the second query than the first? Let's find out by timing each.
from deephaven import read_csv
from time import time
crypto_trades = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/CryptoTrades_20210922.csv")
start = time()
for idx in range(100):
trades_after_3 = crypto_trades.where("toLocalTime(Timestamp, timeZone(`ET`)) >= parseLocalTime(`15:00:00`)")
end = time()
print(f"Compute every time: {((end - start)/100):3f} seconds.")
- crypto_trades
- trades_after_3
- Log
from deephaven import read_csv
from time import time
crypto_trades = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/CryptoTrades_20210922.csv")
start = time()
for idx in range(100):
trades_after_3 = crypto_trades.where("toLocalTime(Timestamp, timeZone(`ET`)) >= '15:00:00'")
end = time()
print(f"Compare against string: {((end - start)/100):.3f} seconds.")
- crypto_trades
- trades_after_3
- Log
That's more than 3x faster!
That kind of performance difference could make or break a query.
The rule of thumb
Remember this rule of thumb:
- If a calculation run in a query string will always return the same value, use a constant instead.
Whether you want to use a string, or the Python method outside of a query string, your query will always be faster.
This holds true for all of Deephaven's APIs.
Reach out
Have any questions for us? Reach out on Slack!