Programmatically generate query strings with Python
The Deephaven Query Language allows users to write very powerful queries to filter and modify tables of data. Consider the following query, which uses an assignment formula to add a new column to a table and a filter formula to filter the resulting table.
from deephaven import empty_table
from deephaven.column import int_col
source = empty_table(7).update(["Value = i"])
result = source.update("X = sqrt(Value) + Value").where("X > 2 && X < 8")
- result
- source
Deephaven query strings are passed into table operations as Python strings. As such, all of the power of Python can be used to generate query strings. This can be convenient when working with complex queries. Let's work though a few examples that are simplified by using Python to generate query strings.
This guide assumes you are familiar with the use of strings, f-strings, loops, and list comprehension in Python. If not, please refer to the Python documentation for more information.
Many columns
In practice, queries may have a large number of inputs, making it inconvenient to type in each column name. Other times, the input column names are determined by user inputs and are not known when the query is written. Both of these situations can be addressed by using a list of column names to generate queries.
In the following example, an f-string and str.join
are used to create a query string to sum up all of the columns and then take the square root.
from deephaven import new_table
from deephaven.column import int_col
cols = ["A", "B", "C", "D"]
source = new_table([int_col(c, [0, 1, 2, 3, 4, 5, 6]) for c in cols])
result = source.update(f"X = sqrt(sum({','.join(cols)}))")
- result
- source
If the list of columns changes, the query string programmatically adapts:
from deephaven import new_table
from deephaven.column import int_col
cols = ["A", "B", "C", "D", "E"]
source = new_table([int_col(c, [0, 1, 2, 3, 4, 5, 6]) for c in cols])
result = source.update(f"X = sqrt(sum({','.join(cols)}))")
- result
- source
Repeated logic
Some queries repeat the same logic -- with minor tweaks. For example, a query may add columns containing data from 1, 5, and 10 minutes ago. Generated query strings can also help simplify these situations.
In the following example, an f-string is used to create columns of data from 1, 5, and 10 rows before.
from deephaven import empty_table
source = empty_table(100).update("X = ii")
offsets = [1, 5, 10]
result = source
for offset in offsets:
result = result.update(f"X{offset} = X_[ii-{offset}]")
- result
- source
This can be simplified further by using a list comprehension.
from deephaven import empty_table
source = empty_table(100).update("X = ii")
offsets = [1, 5, 10]
result = source.update([f"X{offset} = X_[ii-{offset}]" for offset in offsets])
- result
- source
Data analysis, particularly in finance, often involves binning data into time buckets for analysis. These queries rarely use a single time bucket to analyze the data - they often use several or more. Python's f-strings make queries shorter and more readable. Consider first, a query that places data into 9 different temporal buckets without f-strings:
from deephaven import empty_table
source = empty_table(100).update(
[
"Timestamp = '2024-03-15T09:30:00 ET' + i * MINUTE",
"Price = randomDouble(0, 100)",
"Size = randomInt(0, 25)",
]
)
result = source.update(
[
"Bin3Min = lowerBin(Timestamp, 3 * MINUTE)",
"Bin5Min = lowerBin(Timestamp, 5 * MINUTE)",
"Bin7Min = lowerBin(Timestamp, 7 * MINUTE)",
"Bin10Min = lowerBin(Timestamp, 10 * MINUTE)",
"Bin15Min = lowerBin(Timestamp, 15 * MINUTE)",
"Bin20Min = lowerBin(Timestamp, 20 * MINUTE)",
"Bin30Min = lowerBin(Timestamp, 30 * MINUTE)",
"Bin45Min = lowerBin(Timestamp, 45 * MINUTE)",
"Bin60Min = lowerBin(Timestamp, 60 * MINUTE)",
]
)
- result
- source
Not only was that query tedious to write, but the formatting is long and repetitive, and doesn't take advantage of Python's power. Consider the following query, which does the same thing, but with f-strings and list comprehension.
from deephaven import empty_table
source = empty_table(100).update(
[
"Timestamp = '2024-03-15T09:30:00 ET' + i * MINUTE",
"Price = randomDouble(0, 100)",
"Size = randomInt(0, 25)",
]
)
bin_sizes = [3, 5, 7, 10, 15, 20, 30, 45, 60]
result = source.update(
[
f"Bin{bin_size}Min = lowerBin(Timestamp, {bin_size} * MINUTE)"
for bin_size in bin_sizes
]
)
- result
- source
This query is shorter, faster to write, and easier to read for a Python programmer. It also makes future updates easier to write, and changes to the bin_sizes
list are automatically reflected in the result
table.
Be creative!
Programatically generating query strings works for all Deephaven operations, not just update
. For example, this case uses multiple programatically generated query strings while performing a join.
from deephaven import empty_table
source = empty_table(100).update(["X = ii", "Y = X", "Z = sqrt(X)"])
offsets = [1, 5, 10]
result = source
for offset in offsets:
result = result.natural_join(
source.update(f"XOffset = X+{offset}"),
on="X=XOffset",
joins=[f"Y{offset}=Y", f"Z{offset}=Z"],
)
- result
- source