How to join two tables together
This guide discusses how to join two tables in Deephaven. There are several methods to choose from; while the basic syntax is the same, this article demonstrates the differences between each join. Please see our conceptual guide, Choose a join method, to learn more about which join is right for your use case.
Frequently, related data is stored in multiple tables. For example, a company might save its employee data in one table, and its department data in another, as shown below.
from deephaven import new_table
from deephaven.column import string_col, int_col
from deephaven.constants import NULL_INT
employee_table = new_table([
string_col("LastName", ["Rafferty", "Jones", "Steiner", "Robins", "Smith", "Rogers"]),
int_col("DeptID", [31, 33, 33, 34, 34, NULL_INT]),
string_col("Telephone", ["(303) 555-0162", "(303) 555-0149", "(303) 555-0184", "(303) 555-0125", None, None])
])
department_table = new_table([
int_col("DeptID", [31, 33, 34, 35]),
string_col("DeptName", ["Sales", "Engineering", "Clerical", "Marketing"]),
])
- employee_table
- department_table
However, there may be instances where you would like to combine - or join - certain columns of data from multiple, related tables. For example, you might want a single table that lists employee phone numbers along with their associated department names and department phone numbers.
Deephaven provides several join methods that cater to common use cases.
Join Type | Method |
---|---|
As-of Join | aj() |
Exact Join | exact_join() |
Full Outer Join | full_outer_join() |
Join | join() |
Left Outer Join | left_outer_join() |
Natural Join | natural_join() |
Range Join | range_join() |
Reverse As-of Join | raj() |
Syntax
Most join methods combine data from two tables. These tables are usually referred to as the "left table" and the "right table":
- The left table is the base table data is added to.
- The right table is the source of data added to the left table.
In most cases, there will be one or more columns (ColumnsToMatch
) used as keys to match data between the left and right tables. This format is fundamental for writing join statements in Deephaven. However, there are variations in the syntax for different circumstances.
Here is the basic format and syntax for all join methods:
result = left_table.join_method(table = right_table, on = ["ColumnsToMatch, ColumnsToAdd"])
result = left_table.join_method(table=right_table, on=["ColumnsToMatch"], joins=["ColumnsToJoin"])
The left table is followed by the specific join method we want to use. The arguments of the join method are:
- Right table, expressed as
table = Table
, - the column or columns on which to match, expressed as
on = ["ColumnToMatch1, ColumnToMatch2, ..."]
, and - (Optional) the column or columns in the right table to join to the left table, expressed as
joins = ["ColumnToJoin1, ColumnToJoin2, ..."]
. If excluded, all columns are joined.
Add all columns
Using the tables shown below, we want to get data from the department_table
(the right table) and join it to the employee_table
(the left table). Both tables have DeptID
as a common column which we can use as a key to join the content from the right table to the left table.
If we want to join all the columns from the right table, we use the two-argument join method:
result_table = left_table.join_method(table=right_table, on=["ColumnsToMatch"])
from deephaven import new_table
from deephaven.column import string_col, int_col
from deephaven.constants import NULL_INT
employee_table = new_table([
string_col("LastName", ["Rafferty", "Jones", "Steiner", "Robins", "Smith", "Rogers"]),
int_col("DeptID", [31, 33, 33, 34, 34, NULL_INT]),
string_col("Telephone", ["(303) 555-0162", "(303) 555-0149", "(303) 555-0184", "(303) 555-0125", None, None]),
])
department_table = new_table([
int_col("DeptID", [31, 33, 34, 35]),
string_col("DeptName", ["Sales", "Engineering", "Clerical", "Marketing"]),
string_col("DeptManager", ["Martinez", "Williams", "Garcia", "Lopez"]),
int_col("DeptGarage", [33, 52, 22, 45])
])
combined = employee_table.natural_join(table=department_table, on=["DeptID"])
- employee_table
- department_table
- combined
However, to reduce the number of columns in the output table, we can specify columns to add via the third argument (ColumnsToJoin
). ColumnsToJoin
is a comma-separated list (an array) of column names:
left_table.join_method(table = right_table, on = ["ColumnsToMatch"] , joins = ["ColumnToJoin1, ColumnToJoin2"])
from deephaven import new_table
from deephaven.column import string_col, int_col
from deephaven.constants import NULL_INT
employee_table = new_table([
string_col("LastName", ["Rafferty", "Jones", "Steiner", "Robins", "Smith", "Rogers"]),
int_col("DeptID", [31, 33, 33, 34, 34, NULL_INT]),
string_col("Telephone", ["(303) 555-0162", "(303) 555-0149", "(303) 555-0184", "(303) 555-0125", None, None]),
])
department_table = new_table([
int_col("DeptID", [31, 33, 34, 35]),
string_col("DeptName", ["Sales", "Engineering", "Clerical", "Marketing"]),
string_col("DeptManager", ["Martinez", "Williams", "Garcia", "Lopez"]),
int_col("DeptGarage", [33, 52, 22, 45])
])
combined = employee_table.natural_join(table=department_table, on=["DeptID"], joins=["DeptName, DeptManager"])
- employee_table
- department_table
- combined
Use multiple match columns
It is possible to join tables on multiple key columns. This is done by listing all of the key columns within the ColumnsToMatch
argument as comma-separated values.
For a join with two different ColumnsToMatch
, the format would look like this, with each matching column listed and separated by commas within the quotes containing the second argument:
left_table.join_method(table = right_table, on = ["ColumnToMatch1, ColumnToMatch2, ..."], joins = ["ColumnsToJoin"])
In the example below, only one element matches for both DeptID
and DeptGarage
:
from deephaven import new_table
from deephaven.column import string_col, int_col
from deephaven.constants import NULL_INT
employee_table = new_table([
string_col("LastName", ["Rafferty", "Jones", "Steiner", "Robins", "Smith", "Rogers"]),
int_col("DeptID", [31, 33, 33, 34, 34, NULL_INT]),
string_col("Telephone", ["(303) 555-0162", "(303) 555-0149", "(303) 555-0184", "(303) 555-0125", None, None]),
int_col("DeptGarage", [33, 33, 33, 52, 52, 22])
])
department_table = new_table([
int_col("DeptID", [31, 33, 34, 35]),
string_col("DeptName", ["Sales", "Engineering", "Clerical", "Marketing"]),
string_col("DeptManager", ["Martinez", "Williams", "Garcia", "Lopez"]),
int_col("DeptGarage", [33, 52, 22, 45])
])
combined = employee_table.natural_join(table = department_table, on =["DeptID, DeptGarage"])
- employee_table
- department_table
- combined
Match columns with different names
When joining data from two different tables, you may need to match on columns that do not have the same name. For example, in the tables below, the column representing the department number has a different name in each table.
To join the two tables based on this common column, we need to use the equals sign (=
) to indicate which pairs of columns should be matched. DeptID
is the name of the column containing the matching criteria in the left table, and DeptNumber
is the name of the column containing the matching criteria in the right table. Therefore, the following argument would be used for the ColumnsToMatch
argument: "DeptID = DeptNumber"
.
The basic syntax looks like this:
result = left_table.join_method(table = right_table, on = ["ColumnToMatchLeft = ColumnToMatchRight"], joins = ["ColumnsToJoin"])
from deephaven import new_table
from deephaven.column import string_col, int_col
from deephaven.constants import NULL_INT
employee_table = new_table([
string_col("LastName", ["Rafferty", "Jones", "Steiner", "Robins", "Smith", "Rogers"]),
int_col("DeptNumber", [31, 33, 33, 34, 34, NULL_INT]),
string_col("Telephone", ["(303) 555-0162", "(303) 555-0149", "(303) 555-0184", "(303) 555-0125", None, None]),
])
department_table = new_table([
int_col("DeptID", [31, 33, 34, 35]),
string_col("DeptName", ["Sales", "Engineering", "Clerical", "Marketing"]),
])
combined = employee_table.natural_join(table = department_table, on = ["DeptNumber=DeptID"])
- employee_table
- department_table
- combined
Rename appended columns
When joining data from two different tables, you may want to rename a column from the right table before adding it to the left table. For example, both tables below have a column labeled Telephone
. When adding data from the right table, Telephone
can be renamed to DeptTelephone
to avoid a conflict with the Telephone
column inherited from the left table.
from deephaven import new_table
from deephaven.column import string_col, int_col
from deephaven.constants import NULL_INT
employee_table = new_table([
string_col("LastName", ["Rafferty", "Jones", "Steiner", "Robins", "Smith", "Rogers"]),
int_col("DeptID", [31, 33, 33, 34, 34, NULL_INT]),
string_col("Telephone", ["(303) 555-0162", "(303) 555-0149", "(303) 555-0184", "(303) 555-0125", None, None]),
])
department_table = new_table([
int_col("DeptID", [31, 33, 34, 35]),
string_col("DeptName", ["Sales", "Engineering", "Clerical", "Marketing"]),
string_col("Telephone", ["(303) 555-0136", "(303) 555-0162", "(303) 555-0175", "(303) 555-0171"])
])
combined = employee_table.natural_join(table = department_table, on = ["DeptID"], joins = ["DeptName, DeptTelephone=Telephone"])
- employee_table
- department_table
- combined