Join
Joining data from one table to another is vital in any data system. Deephaven provides several join methods to choose from depending on your specific needs. In Deephaven, joins update in real time as new data becomes available:
- Natural Join -
naturalJoin()
- As-of Join / Reverse-as-of Join -
aj()
andraj()
- Exact Join -
exactJoin()
- Left Join -
leftJoin()
- Join -
join ()
The syntax of all joins follows this pattern (where the joinMethod
appears in the list above):
leftTable.joinMethod(rightTable, "Key(s)", "Column(s)ToJoinIn")
You may specify multiple ColumnsToJoinIn
. Sets have quotes around the whole set, not each element:
- Use double quotes around a single key -
“Key”
- or around a set -“Key1, Key2, Key3”
. - Use double quotes around a single
"ColumnToJoinIn”
or around the set of“Col1, Col2, Col3”
.
We'll discuss these general syntax components below.
Concepts
Left Table and Right Table
The leftTable
is the starting base table to which data is joined, and the rightTable
is the table from which data comes.
Tip
Join data from the right to the left.
Join Keys
The join key or keys establishes the identifier in the leftTable
that will be used to find data in the rightTable
. Any data type can be chosen as a key, and there can be multiple keys.
Think of keys as “columns to match” in all Deephaven Joins.
-
Keys are chosen based on use cases: “I want to bring data to the table todayTrades, from the table yestTrades, based on the USym of each row (in todayTrades).”
leftTable = todayTrades
rightTable = yestTrades
key = USym
-
“I want to bring data to the table risk, from the table factorData, based on the Sector and FactorName of each row (in risk).”
leftTable = risk
rightTable = factorData
keys = USym, FactorName
Keys can have different column names in the two tables. To rename columns within your query, use one of the following options:
leftTable.joinMethod(rightTable, "KeyLeftName = KeyRightName")
leftTable.joinMethod(rightTable, "Key1, Key2LeftName=Key2RightName")
left.joinMethod(right, "KeyLeft1=KeyRight1, KeyLeft2=KeyRight2")
Warning
Queries will error out if the identified Key columns do not exist in the respective tables.
Columns to Join In
You can choose what columns to bring from the rightTable to the leftTable by populating the ColumnsToJoinIn
argument.
- Bring all the non-Key columns:
leftTable.joinMethod(rightTable, "Key")
- Bring only the ones I select:
leftTable.joinMethod(rightTable, "Key", "Col1")
- Bring several that I select:
leftTable.joinMethod(rightTable, "Key", "Col1, Col2, ColZ")
- Bring and rename the ones I select:
leftTable.joinMethod(rightTable, "Key", "New1=Col1, New2=Col2")
Warning
Queries will error out if either:
- One of the
columnsToJoinIn
from therightTable
already exists in theleftTable
(i.e., same name). - The identified column from the
rightTable
does not exist.
Examples
Continue on to learn about each distinct join method and their particularities.
Natural Join
The naturalJoin
is somewhat like an Excel vlookup or SQL leftJoin:
- The leftTable always retains the same number of rows with which it started.
- If there are no matching values for a row, the appended cell(s) from the rightTable will contain NULL values.
- The right side table can only have one row for each key of the join.
Syntax
leftTable.naturalJoin(rightTable, "Key", "ColumnsToJoinIn")
One-key naturalJoin()
leftTable = db.t("LearnDeephaven", "StockTrades").where("Date = `2017-08-25`" , "isBusinessTime(ExchangeTimestamp)").lastBy("USym").view("Date", "USym", "Last")
rightTable = db.t("LearnDeephaven", "StockTrades").where("Date = `2017-08-24`", "isBusinessTime(ExchangeTimestamp)").lastBy("USym").view("Date", "USym", "Last")
njThem = leftTable.naturalJoin(rightTable,"USym", "ThursLast = Last")
Multi-key naturalJoin()
# get Vwap prices by hour
leftTable2 = db.t("LearnDeephaven", "StockTrades").where("Date = `2017-08-25`" , "isBusinessTime(ExchangeTimestamp)")\
.view("USym", "TimeBin = upperBin(ExchangeTimestamp, HOUR)", "Size", "Dollars = Last * Size")\
.sumBy("USym", "TimeBin").updateView("Vwap = Dollars / Size")
# get the last price of each hour
rightTable2 = db.t("LearnDeephaven", "StockTrades").where("Date = `2017-08-25`" , "isBusinessTime(ExchangeTimestamp)")\
.view("USym", "TimeBin = upperBin(ExchangeTimestamp, HOUR)", "Last", "Size")\
.lastBy("USym", "TimeBin")
njThem2 = leftTable2.naturalJoin(rightTable2, "USym, TimeBin",\
"Last, Last_Size = Size")
As-of Join and Reverse-as-of Join
Unique to Deephaven, the As-of Join and the Reverse As-Of Join methods enable users to join data from tables that do not have an exact match in the key argument. This is especially useful when analyzing temporal data.
- As-of Join allows you to match on the closest value that is less than or equal to.
- Reverse As-of Join allows you to match on the closest value that is equal to or greater than.
Syntax
As-of joins follow the same syntax as other joins, but the method usually involves two or more matching columns.
.aj(rightTable, "Key, LastKey", "ColumnsToJoinIn")
The initial key(s) is used to perform an exact match in both the left table and the right table, while the last key does not need to be an exact match. Instead, for that key, Deephaven finds the value in the right table that is closest to the corresponding value in the left table without going over (i.e. “same or just-earlier”). Technically, any sorted column can be used as the (last) join key, but in most cases some version of timestamp will be employed.
One-key aj()
leftTable3 = db.t("LearnDeephaven", "StockTrades")\
.where("Date=`2017-08-25`")\
.view("Sym", "ExchangeTimestamp", "Last")
rightTable3 = db.t("LearnDeephaven", "StockTrades")\
.where("Date=`2017-08-25`", "USym = `GOOG`")\
.view("ExchangeTimestamp", "Last_GOOG = Last")
ajThem3 = leftTable3.aj(rightTable3, "ExchangeTimestamp")
Multi-key aj()
leftTable4 = db.t("LearnDeephaven", "StockTrades")\
.where("Date=`2017-08-25`")\
.view("USym", "ExchangeTimestamp", "TradePrice = Last")
rightTable4 = db.t("LearnDeephaven", "StockQuotes")\
.where("Date=`2017-08-25`")\
.view("USym", "ExchangeTimestamp", "Bid", "Ask")
ajThem4 = leftTable4.aj(rightTable4, "USym, ExchangeTimestamp",\
"QuoteTime = ExchangeTimestamp, Bid, Ask")\
.updateView("Mid = (Bid + Ask) / 2")
raj()
a table on itself
leftTable5 = db.t("LearnDeephaven", "StockTrades")\
.where("Date=`2017-08-25`")\
.view("USym", "ExchangeTimestamp", "TradePrice = Last")
# same table just lagged 5 minutes
rightTable5 = leftTable5\
.updateView("TimeLag5min = ExchangeTimestamp - 5 * MINUTE")
ajOnSameTable = leftTable5.raj(rightTable5,\
"USym, ExchangeTimestamp = TimeLag5min",\
"TimeLater = TimeLag5min, PriceLater = TradePrice")
Exact Join
The Exact Join method:
- Requires exactly one match from the rightTable for each of the rows in the leftTable.
- If there are zero or multiple matches, the operation will fail.
Syntax
leftTable.exactJoin(rightTable,"Key","ColumnsToJoinIn")
from deephaven import *
leftTable6 = db.t("LearnDeephaven", "StockTrades")\
.where("Date=`2017-08-25`")\
.headBy(3, "Sym")\
.view("Sym", "Last", "Size")
rightTable6 = leftTable6.by(caf.AggCombo(caf.AggAvg("AvgPrice=Last"), caf.AggSum("TotalShares=Size")), "Sym").update("TotalShares=(long)TotalShares")
ejThem =leftTable6.exactJoin(rightTable6, "Sym", "AvgPrice, TotalShares")
Join
When using the join
method, Deephaven compares the leftTable to the rightTable, and for every instance that matches the key argument for both tables, Deephaven will then "join" the data requested from the right table to the left table.
- If there is no match for a key from the leftTable in the rightTable, the row from the leftTable is removed from the result. (Note that for
naturalJoin()
, the joined columns are null.) - If there are multiple matches for a key between the two tables, multiple rows are produced; i.e., the leftTable row is repeated for each of the rightTable matches. (Note that for
naturalJoin()
, this query would fail.) - It is possible, therefore, for the results of a
join()
operation to have less rows or more rows than are present in the left table.
Syntax
leftTable.join(rightTable, "Key", "ColumnsToJoinIn", numRightBitsToReserve)
Note the optional argument numRightBitsToReserve
. This is the number of bits to reserve for the right table rows. The default is 10; see the OutOfKeySpaceException section below for more details.
One special case of the join
method is a cross-join, or cartesian product, which takes the following form:
crossJoin = leftTable.join(rightTable,"", "Col1,Col2")
Note that the second argument - the key - is empty. The result of this query is a table with every possible combination of rows from the two tables.
Note
The query below uses "rightTable6" from the previous example.
from deephaven import *
leftTable7 = db.t("LearnDeephaven", "StockTrades")\
.where("Date=`2017-08-25`")\
.headBy(3, "Sym")\
.view("Sym", "Last", "Size")
rightTable7 = rightTable6.tail(9)
j1 = leftTable7.join(rightTable7, "Sym", "AvgPrice, TotalShares")
duplicate = ttools.emptyTable(1).updateView("Sym=`MSFT`", "AvgPrice=72.9", "TotalShares=(long)23")
# Note: using (long)23 ensures 23 is interpreted as type long
rtDup = ttools.merge(rightTable6, duplicate)
j2 = leftTable7.join(rtDup, "Sym", "AvgPrice, TotalShares")
Troubleshooting an OutOfKeySpace Exception
To efficiently produce updates, the bits that represent a key for a given result row are split into two. Unless specified, join
reserves 10 bits to represent a right row. When there are too few bits to represent all of the right rows for a given aggregation group the table will shift a bit from the left side to the right side. The default of 10 bits was carefully chosen because it results in an efficient implementation to process live updates.
An OutOfKeySpaceException
is thrown when the total number of bits needed to express the result table exceeds that needed to represent Long.MAX_VALUE
. The exception provides details such as how much key-space the left and right tables are using, how sparse the tables appear to be, and how many of the bits in the result key-space are being used for each table.
There are a few workarounds:
- If the leftTable is sparse, consider flattening the leftTable. To flatten a table invoke the no-argument
flatten()
method. - If there are no key-columns and the rightTable is sparse, consider flattening the rightTable. (Note: when there are key-columns, flattening the rightTable will not affect bits allocated to right rows.)
- If the maximum size of a rightTable's group is small, you can reserve fewer bits by setting
numRightBitsToReserve
on initialization.
Left Join
- The
leftJoin()
method returns the exact rows of the leftTable. - The data joined in from the rightTable are grouped into arrays of data.
Syntax
.leftJoin(rightTable, "Key", "ColumnsToJoinIn")
For example, a leftTable of end-of-day trade-count could be leftJoin-ed by USym with a rightTable of trades where the Price column would be an array.
leftTable = trades.countBy("TradeCount", "USym")
rightTable = trades\
.where("isBusinessTime(ExchangeTimestamp)")\
.renameColumns("Price = Last")
joinThem = leftTable.leftJoin(rightTable, "USym", "Price")