Deephaven Community Core version 0.38.0 is out now! It's been a few months since the last Community release, so there's a lot to cover. Let's get into it!
New features
Enhanced natural join behavior
You can now specify behavior when performing a natural_join
on two tables. Available choices are:
ERROR_ON_DUPLICATE
: This is the default behavior of the operation iftype
is not given.FIRST_MATCH
: Equivalent to running afirst_by
prior to the join.LAST_MATCH
: Equivalent to running alast_by
prior to the join.EXACTLY_ONE_MATCH
: Equivalent to running anexact_join
operation.
The following code shows all four of these options in action:
from deephaven import new_table
from deephaven.column import string_col, int_col
from deephaven.constants import NULL_INT
from deephaven.table import NaturalJoinType
source_left = new_table(
[
string_col("LastName", ["Rafferty", "Jones", "Steiner", "Robins", "Smith"]),
int_col("DeptID", [31, 33, 33, 34, 34]),
string_col(
"Telephone",
[
"(303) 555-0162",
"(303) 555-0149",
"(303) 555-0184",
"(303) 555-0125",
"",
],
),
]
)
source_right = new_table(
[
int_col("DeptID", [31, 33, 34, 35]),
string_col("DeptName", ["Sales", "Engineering", "Clerical", "Marketing"]),
string_col(
"DeptTelephone",
["(303) 555-0136", "(303) 555-0162", "(303) 555-0175", "(303) 555-0171"],
),
]
)
# Default
result_error_on_duplicates = source_left.natural_join(table=source_right, on=["DeptID"], type=NaturalJoinType.ERROR_ON_DUPLICATE)
# First match
result_first_match = source_left.natural_join(table=source_right, on=["DeptID"], type=NaturalJoinType.FIRST_MATCH)
# Last match
result_last_match = source_left.natural_join(table=source_right, on=["DeptID"], type=NaturalJoinType.LAST_MATCH)
# Exactly one match
result_exact_match = source_left.natural_join(table=source_right, on=["DeptID"], type=NaturalJoinType.EXACTLY_ONE_MATCH)
The count_where
aggregation
Both Python and Groovy APIs now support the count_where
/countWhere
aggregation. This operation counts and aggregates the number of instances where one or more conditions are true. The filter conditions can be either conjunctive (AND) or disjunctive (OR). Like with other aggregations, the calculations can be bucketed by one or more grouping columns. Here's an example in Python:
from deephaven import empty_table
from deephaven.agg import count_where
from deephaven.filters import or_
source = empty_table(100).update(["X = i", "Y = randomDouble(0, 1)", "Z = i % 3", "String = i % 2 == 0 ? `even` : `odd`"])
# An aggregation with no keys (grouping columns). The filters here are conjunctive (AND).
result_zerokeys = source.agg_by(aggs=count_where(col="count", filters=["X < 42", "Y >= 0.58"]))
# An aggregation with two grouping columns. The filters here are disjunctive (OR).
result_twokeys = source.agg_by(aggs=count_where(col="count", filters="X >= 29 || Y < 0.7"), by=["Z", "String"])
This feature is available for update_by
in both cumulative and rolling contexts as well:
from deephaven import empty_table
from deephaven.updateby import cum_count_where, rolling_count_where_tick
source = empty_table(100).update(["Key=randomInt(0,5)", "IntCol=randomInt(0,100)"])
# zero-key
result_zerokeys = source.update_by([
cum_count_where(col="CumulativeCountOver50", filters="IntCol > 50"),
rolling_count_where_tick(rev_ticks=50, col="RollingCountOver50", filters="IntCol > 50"),
])
# bucketed
result_onekey = source.update_by([
cum_count_where(col="CumulativeCountOver50", filters="IntCol > 50"),
rolling_count_where_tick(rev_ticks=50, col="RollingCountOver50", filters="IntCol > 50"),
], by="Key")
Additional features
JS API
- A custom gRPC transport layer for JS API consumers.
- JS API support for creating and consuming shared tickets.
Python
- A new
systemic_obj_tracker
Python module that allows users to enable/disable systemic object marking. - Partitioned table support in the Python Table Data Service.
- An
is_failed
property on tables in Python to check if a table has failed and is no longer usable.
Parquet and Iceberg
- Custom resolution of Parquet file columns into Deephaven table columns based on arbitrary criteria.
- Support for reading Parquet data and Iceberg tables from URIs that use the
s3a
ands3n
schemas.
The engine
- The ability to disable data index usage in the engine.
Improvements
Improvements include both general enhancements and bug fixes. Here are some of the most notable ones.
General improvements
Iceberg
- Deephaven now verifies supported data types before writing to Iceberg, failing early if it encounters any unsupported types.
- The Iceberg writing API has been simplified, making S3-specific instructions optional by defaulting to settings from the catalog.
- Sort order for Iceberg tables is now properly recognized and handled.
JS API
- Median is now available to the JS API as an aggregation option.
The engine
- Better Data Index performance.
Bug fixes
The bug fixes in the subsections below are not comprehensive but cover the most significant issues resolved in this release.
Iceberg and Parquet
- Iceberg tables partitioned by date can now properly be read from.
- Deephaven no longer resolves credentials on every S3 read if no credentials were given.
Python
- The Python gRPC client no longer fails calls that had half-closed successfully.
- The Python Table Data Service now properly handles Optional parameters in callback signatures.
- Deephaven now properly handles Python's shape typing in UDF parsing.
UI
- Plots on aggregated tables will now tick properly in lock-step with the source table.
- An error should no longer occur when switching between UI tabs with Deephaven Express plots.
- Charts via both Deephaven Express and the built-in plotting API should no longer fail on certain table types.
- Null boolean cells now accurately portray the underlying data in the UI.
Integrations
- The Flight SQL server now properly acquires the shared lock, enabling all table operations against refreshing tables.
General
- Left and full outer joins now work properly when the right hand table is initially empty.
- Minimum and Maximum
update_by
operations now return column types that match the input data. - Performing a
count_where
on a rollup table will no longer improperly perform a regularcount
. - Rollup tables will no longer print
DEPENDENCY_RELEASED
errors to the console when groupings are added. update_by
will no longer incorrectly order resultant columns.- Snapshotting a sorted rollup table will now produce the correct results.
Reach out
Our Slack community continues to grow! Join us there for updates and help with your queries.