Using Deephaven's QueryScope
At Deephaven Data Labs, we highlight the importance of hiring the best developers to deliver the best product to our client. So what makes a top tier developer? Many things make a great developer, such as experience within the career field, knowledge of multiple coding languages, and the ability to create the most efficient code. Creating efficient code is extremely important when it comes to writing programs with long run times. While using Deephaven’s database, it is common to create programs that will analyze millions - if not billions - of rows of data. Coding with big data can lead to long runtimes - luckily for Deephaven users, our developers anticipated these challenges and built-in a toolset within the query language to help optimize and speed up code. One such function is the QueryScope function, which allows users to add variables to the query language.
Real-Life Example
I recently created a program that stored the end of day options and stock prices from the stock market. This program analyzed over 35 billion rows of data and narrowed the data down to only the relevant 10 million rows. The runtime to sort and save all of this data was over 15 hours, but if I had not optimized my code, the run time would have been over three times as long.
First Program:
def endOfDayTableMaker(options, stocks, start, end):
for day in calendar.businessDaysInRange(start, end):
print(day)
QueryScope.addParam("day", day)
stocksEOD = stocks.where("Date = day", "Timestamp < convertDateTime(day + `T16:00:00 NY`)")
db.replaceTablePartition("Correlation", "StocksEndOfDay", day, stocksEOD)
optionsEOD = options.where("Date = day", "Timestamp < convertDateTime(day + `T16:00:00 NY`)")
db.replaceTablePartition("Correlation", "EndOfDay", day, optionsEOD)
return optionsEOD
On my first attempt, I successfully created a program that did exactly what I wanted it to do - store the end of day values. However, it took over 7 minutes to analyze one day’s worth of data and save that data. I needed my final program to analyze around a year and a half of stock and options data, and at 7 minutes per day, it would have to run for over 2 days. Desperate to reduce run time, I looked back to my code to see how to optimize it. Upon further inspection, it became clear that the convertDateTime function was being called for every row within the table that I was sorting. The purpose of the convertDateTime function is to take a string input and convert it to DBDateTime format; however, this conversion is slow when being called 35 billion times. In my first program, I put the convertDateTime
function within the where function: because .where()
filters each row based on the parameters within the function, convertDateTime
was being called for every row, up to 100 million times for just one day’s worth of data This was redundant because the convertDateTime
function returned the same value during the same day. As it turned out, optimizing this code was simple: all I had to do was utilize Deephaven's QueryScope function and move the convertDateTime function out of the where function.
Within the Deephaven Database, users can choose between using Groovy or Python for coding. Regardless of which one a user picks, the Deephaven Query Language is always used when manipulating tables with functions such as where and update. The QueryScope function works by taking a variable from Python or Groovy and making that variable available to the Query language. For example, in my first program, I used the QueryScope function to add the variable “day” to the query scope, which allowed me to use that variable from within the where function. My second program utilizes the QueryScope function to create a variable within the query scope that is the result of the convertDateTime
function.
Second Program:
def endOfDayTableMaker(options, stocks, start, end):
for day in calendar.businessDaysInRange(start, end):
print(day)
QueryScope.addParam("day", day)
QueryScope.addParam("time", convertDateTime(day + "T16:00:00 NY"))
stocksEOD = stocks.where("Date = day", "Timestamp < time")
db.replaceTablePartition("Correlation", "StocksEndOfDay", day, stocksEOD)
optionsEOD = options.where("Date = day", "Timestamp < time")
db.replaceTablePartition("Correlation", "EndOfDay", day, optionsEOD)
return optionsEOD
This only required the addition of one line of code, yet the difference is significant: now, the convertDateTime
function will only be called once per day and this variable can be accessed within the query language with the variable “time”. With this small change, the run time for one day decreased to under 2 minutes. This is a perfect example of the difference between code that works and code that is optimized.
Summary
Just because a program works doesn’t mean that it's the best that it can be. Oftentimes there are ways to decrease runtime and to make the code easier for readers to follow along. Ideally, code will not call functions multiple times if this can be avoided, and Deephaven's database provides an easy tool to work with tables efficiently as functions can be saved as variables using the QueryScope. Always keep an eye out for ways to improve your code and your coding skills.