deephaven.experimental.iceberg

This module adds Iceberg table support into Deephaven.

class IcebergCatalogAdapter(j_object)[source]

Bases: JObjectWrapper

This class provides an interface for interacting with Iceberg catalogs. It allows listing namespaces, tables and snapshots, as well as reading Iceberg tables into Deephaven tables.

j_object_type

alias of IcebergCatalogAdapter

load_table(table_identifier)[source]

Load the table from the catalog.

Parameters:

table_identifier (str) – the table to read.

Returns:

the table read from the catalog.

Return type:

Table

namespaces(namespace=None)[source]

Returns information on the namespaces in the catalog as a Deephaven table. If a namespace is specified, the tables in that namespace are listed; otherwise the top-level namespaces are listed.

Parameters:

namespace (Optional[str]) – the higher-level namespace from which to list namespaces; if omitted, the top-level namespaces are listed.

Return type:

Table

Returns:

a table containing the namespaces.

tables(namespace)[source]

Returns information on the tables in the specified namespace as a Deephaven table.

Parameters:

namespace (str) – the namespace from which to list tables.

Return type:

Table

Returns:

a table containing the tables in the provided namespace.

class IcebergReadInstructions(table_definition=None, data_instructions=None, column_renames=None, update_mode=None, snapshot_id=None)[source]

Bases: JObjectWrapper

This class specifies the instructions for reading an Iceberg table into Deephaven. These include column rename instructions and table definitions, as well as special data instructions for loading data files from the cloud.

Initializes the instructions using the provided parameters.

Parameters:
  • table_definition (Optional[TableDefinitionLike]) – the table definition; if omitted, the definition is inferred from the Iceberg schema. Setting a definition guarantees the returned table will have that definition. This is useful for specifying a subset of the Iceberg schema columns.

  • data_instructions (Optional[s3.S3Instructions]) – Special instructions for reading data files, useful when reading files from a non-local file system, like S3.

  • column_renames (Optional[Dict[str, str]]) – A dictionary of old to new column names that will be renamed in the output table.

  • update_mode (Optional[IcebergUpdateMode]) – The update mode for the table. If omitted, the default update mode of IcebergUpdateMode.static() is used.

  • snapshot_id (Optional[int]) – the snapshot id to read; if omitted the most recent snapshot will be selected.

Raises:

DHError – If unable to build the instructions object.

j_object_type

alias of IcebergReadInstructions

class IcebergTable(j_table)[source]

Bases: Table

IcebergTable is a subclass of Table that allows users to dynamically update the table with new snapshots from the Iceberg catalog.

abs_sum_by(by=None)

The abs_sum_by method creates a new table containing the absolute sum for each group.

Parameters:

by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

agg_all_by(agg, by=None)

The agg_all_by method creates a new table containing grouping columns and grouped data. The resulting grouped data is defined by the aggregation specified.

Note, because agg_all_by applies the aggregation to all the columns of the table, it will ignore any column names specified for the aggregation.

Parameters:
  • agg (Aggregation) – the aggregation

  • by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

agg_by(aggs, by=None, preserve_empty=False, initial_groups=None)

The agg_by method creates a new table containing grouping columns and grouped data. The resulting grouped data is defined by the aggregations specified.

Parameters:
  • aggs (Union[Aggregation, Sequence[Aggregation]]) – the aggregation(s)

  • by (Union[str, Sequence[str]]) – the group-by column name(s), if not provided, all rows from this table are grouped into a single group of rows before the aggregations are applied to the result, default is None.

  • preserve_empty (bool) – whether to keep result rows for groups that are initially empty or become empty as a result of updates. Each aggregation operator defines its own value for empty groups. Default is False.

  • initial_groups (Table) – a table whose distinct combinations of values for the group-by column(s) should be used to create an initial set of aggregation groups. All other columns are ignored. This is useful in combination with preserve_empty=True to ensure that particular groups appear in the result table, or with preserve_empty=False to control the encounter order for a collection of groups and thus their relative order in the result. Changes to this table are not expected or handled; if this table is a refreshing table, only its contents at instantiation time will be used. Default is None, the result will be the same as if a table is provided but no rows were supplied. When it is provided, the ‘by’ argument must be provided to explicitly specify the grouping columns.

Return type:

Table

Returns:

a new table

Raises:

DHError

aj(table, on, joins=None)

The aj (as-of join) method creates a new table containing all the rows and columns of the left table, plus additional columns containing data from the right table. For columns appended to the left table (joins), row values equal the row values from the right table where the keys from the left table most closely match the keys from the right table without going over. If there is no matching key in the right table, appended row values are NULL.

Parameters:
  • table (Table) – the right-table of the join

  • on (Union[str, Sequence[str]]) – the column(s) to match, can be a common name or a match condition of two columns, e.g. ‘col_a = col_b’. The first ‘N-1’ matches are exact matches. The final match is an inexact match. The inexact match can use either ‘>’ or ‘>=’. If a common name is used for the inexact match, ‘>=’ is used for the comparison.

  • joins (Union[str, Sequence[str]], optional) – the column(s) to be added from the right table to the result table, can be renaming expressions, i.e. “new_col = col”; default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

attributes()

Returns all the attributes defined on the table.

Return type:

Dict[str, Any]

avg_by(by=None)

The avg_by method creates a new table containing the average for each group.

Parameters:

by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

await_update(timeout=None)

Waits until either this refreshing Table is updated or the timeout elapses if provided.

Parameters:

timeout (int) – the maximum time to wait in milliseconds, default is None, meaning no timeout

Return type:

bool

Returns:

True when the table is updated or False when the timeout has been reached.

Raises:

DHError

coalesce()

Returns a coalesced child table.

Return type:

Table

property column_names

The column names of the table.

property columns

The column definitions of the table.

count_by(col, by=None)

The count_by method creates a new table containing the number of rows for each group.

Parameters:
  • col (str) – the name of the column to store the counts

  • by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

property definition

The table definition.

drop_columns(cols)

The drop_columns method creates a new table with the same size as this table but omits any of specified columns.

Parameters:

cols (Union[str, Sequence[str]) – the column name(s)

Return type:

Table

Returns:

a new table

Raises:

DHError

exact_join(table, on, joins=None)

The exact_join method creates a new table containing all the rows and columns of this table plus additional columns containing data from the right table. For columns appended to the left table (joins), row values equal the row values from the right table where the key values in the left and right tables are equal.

Parameters:
  • table (Table) – the right-table of the join

  • on (Union[str, Sequence[str]]) – the column(s) to match, can be a common name or an equal expression, i.e. “col_a = col_b” for different column names

  • joins (Union[str, Sequence[str]], optional) – the column(s) to be added from the right table to the result table, can be renaming expressions, i.e. “new_col = col”; default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

first_by(by=None)

The first_by method creates a new table containing the first row for each group.

Parameters:

by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

flatten()

Returns a new version of this table with a flat row set, i.e. from 0 to number of rows - 1.

Return type:

Table

format_column_where(col, cond, formula)

Applies color formatting to a column of the table conditionally.

Parameters:
  • col (str) – the column name

  • cond (str) – the condition expression

  • formula (str) – the formatting string in the form of assignment expression “column=color expression” where color_expression can be a color name or a Java ternary expression that results in a color.

Return type:

Table

Returns:

a new table

Raises:

DHError

format_columns(formulas)

Applies color formatting to the columns of the table.

Parameters:

formulas (Union[str, List[str]]) – formatting string(s) in the form of “column=color_expression” where color_expression can be a color name or a Java ternary expression that results in a color.

Return type:

Table

Returns:

a new table

Raises:

DHError

format_row_where(cond, formula)

Applies color formatting to rows of the table conditionally.

Parameters:
  • cond (str) – the condition expression

  • formula (str) – the formatting string in the form of assignment expression “column=color expression” where color_expression can be a color name or a Java ternary expression that results in a color.

Return type:

Table

Returns:

a new table

Raises:

DHError

group_by(by=None)

The group_by method creates a new table containing grouping columns and grouped data, column content is grouped into vectors.

Parameters:

by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

has_columns(cols)

Whether this table contains a column for each of the provided names, return False if any of the columns is not in the table.

Parameters:

cols (Union[str, Sequence[str]]) – the column name(s)

Returns:

bool

head(num_rows)

The head method creates a new table with a specific number of rows from the beginning of the table.

Parameters:

num_rows (int) – the number of rows at the head of table

Return type:

Table

Returns:

a new table

Raises:

DHError

head_by(num_rows, by=None)

The head_by method creates a new table containing the first number of rows for each group.

Parameters:
  • num_rows (int) – the number of rows at the beginning of each group

  • by (Union[str, Sequence[str]]) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

head_pct(pct)

The head_pct method creates a new table with a specific percentage of rows from the beginning of the table.

Parameters:

pct (float) – the percentage of rows to return as a value from 0 (0%) to 1 (100%).

Return type:

Table

Returns:

a new table

Raises:

DHError

Whether this table is a blink table.

property is_flat

Whether this table is guaranteed to be flat, i.e. its row set will be from 0 to number of rows - 1.

property is_refreshing

Whether this table is refreshing.

iter_chunk_dict(cols=None, chunk_size=2048)

Returns a generator that reads one chunk of rows at a time from the table into a dictionary. The dictionary is a map of column names to numpy arrays of the column data type.

If the table is refreshing and no update graph locks are currently being held, the generator will try to acquire the shared lock of the update graph before reading the table data. This provides a consistent view of the data. The side effect of this is that the table will not be able to refresh while the table is being iterated on. Additionally, the generator internally maintains a fill context. The auto acquired shared lock and the fill context will be released after the generator is destroyed. That can happen implicitly when the generator is used in a for-loop. When the generator is not used in a for-loop, to prevent resource leaks, it must be closed after use by either (1) setting it to None, (2) using the del statement, or (3) calling the close() method on it.

Parameters:
  • cols (Optional[Union[str, Sequence[str]]]) – The columns to read. If None, all columns are read.

  • chunk_size (int) – The number of rows to read at a time. Default is 2048.

Return type:

Generator[Dict[str, ndarray], None, None]

Returns:

A generator that yields a dictionary of column names to numpy arrays.

Raises

ValueError

iter_chunk_tuple(cols=None, tuple_name='Deephaven', chunk_size=2048)

Returns a generator that reads one chunk of rows at a time from the table into a named tuple. The named tuple is made up of fields with their names being the column names and their values being numpy arrays of the column data types.

If the table is refreshing and no update graph locks are currently being held, the generator will try to acquire the shared lock of the update graph before reading the table data. This provides a consistent view of the data. The side effect of this is that the table will not be able to refresh while the table is being iterated on. Additionally, the generator internally maintains a fill context. The auto acquired shared lock and the fill context will be released after the generator is destroyed. That can happen implicitly when the generator is used in a for-loop. When the generator is not used in a for-loop, to prevent resource leaks, it must be closed after use by either (1) setting it to None, (2) using the del statement, or (3) calling the close() method on it.

Parameters:
  • cols (Optional[Union[str, Sequence[str]]]) – The columns to read. If None, all columns are read.

  • tuple_name (str) – The name of the named tuple. Default is ‘Deephaven’.

  • chunk_size (int) – The number of rows to read at a time. Default is 2048.

Return type:

Generator[Tuple[ndarray, ...], None, None]

Returns:

A generator that yields a named tuple for each row in the table.

Raises:

ValueError

iter_dict(cols=None, *, chunk_size=2048)

Returns a generator that reads one row at a time from the table into a dictionary. The dictionary is a map of column names to scalar values of the column data type.

If the table is refreshing and no update graph locks are currently being held, the generator will try to acquire the shared lock of the update graph before reading the table data. This provides a consistent view of the data. The side effect of this is that the table will not be able to refresh while the table is being iterated on. Additionally, the generator internally maintains a fill context. The auto acquired shared lock and the fill context will be released after the generator is destroyed. That can happen implicitly when the generator is used in a for-loop. When the generator is not used in a for-loop, to prevent resource leaks, it must be closed after use by either (1) setting it to None, (2) using the del statement, or (3) calling the close() method on it.

Parameters:
  • cols (Optional[Union[str, Sequence[str]]]) – The columns to read. If None, all columns are read.

  • chunk_size (int) – The number of rows to read at a time internally to reduce the number of Java/Python boundary crossings. Default is 2048.

Return type:

Generator[Dict[str, Any], None, None]

Returns:

A generator that yields a dictionary of column names to scalar values.

Raises:

ValueError

iter_tuple(cols=None, *, tuple_name='Deephaven', chunk_size=2048)

Returns a generator that reads one row at a time from the table into a named tuple. The named tuple is made up of fields with their names being the column names and their values being of the column data types.

If the table is refreshing and no update graph locks are currently being held, the generator will try to acquire the shared lock of the update graph before reading the table data. This provides a consistent view of the data. The side effect of this is that the table will not be able to refresh while the table is being iterated on. Additionally, the generator internally maintains a fill context. The auto acquired shared lock and the fill context will be released after the generator is destroyed. That can happen implicitly when the generator is used in a for-loop. When the generator is not used in a for-loop, to prevent resource leaks, it must be closed after use by either (1) setting it to None, (2) using the del statement, or (3) calling the close() method on it.

Parameters:
  • cols (Optional[Union[str, Sequence[str]]]) – The columns to read. If None, all columns are read. Default is None.

  • tuple_name (str) – The name of the named tuple. Default is ‘Deephaven’.

  • chunk_size (int) – The number of rows to read at a time internally to reduce the number of Java/Python boundary crossings. Default is 2048.

Return type:

Generator[Tuple[Any, ...], None, None]

Returns:

A generator that yields a named tuple for each row in the table

Raises:

ValueError

j_object_type

alias of IcebergTable

join(table, on=None, joins=None)

The join method creates a new table containing rows that have matching values in both tables. Rows that do not have matching criteria will not be included in the result. If there are multiple matches between a row from the left table and rows from the right table, all matching combinations will be included. If no columns to match (on) are specified, every combination of left and right table rows is included.

Parameters:
  • table (Table) – the right-table of the join

  • on (Union[str, Sequence[str]]) – the column(s) to match, can be a common name or an equal expression, i.e. “col_a = col_b” for different column names; default is None

  • joins (Union[str, Sequence[str]], optional) – the column(s) to be added from the right table to the result table, can be renaming expressions, i.e. “new_col = col”; default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

last_by(by=None)

The last_by method creates a new table containing the last row for each group.

Parameters:

by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

layout_hints(front=None, back=None, freeze=None, hide=None, column_groups=None, search_display_mode=None)

Sets layout hints on the Table

Parameters:
  • front (Union[str, List[str]]) – the columns to show at the front.

  • back (Union[str, List[str]]) – the columns to show at the back.

  • freeze (Union[str, List[str]]) – the columns to freeze to the front. These will not be affected by horizontal scrolling.

  • hide (Union[str, List[str]]) – the columns to hide.

  • column_groups (List[Dict]) –

    A list of dicts specifying which columns should be grouped in the UI. The dicts can specify the following:

    • name (str): The group name

    • children (List[str]): The column names in the group

    • color (Optional[str]): The hex color string or Deephaven color name

  • search_display_mode (SearchDisplayMode) – set the search bar to explicitly be accessible or inaccessible, or use the system default. SearchDisplayMode.SHOW will show the search bar, SearchDisplayMode.HIDE will hide the search bar, and SearchDisplayMode.DEFAULT will use the default value configured by the user and system settings.

Return type:

Table

Returns:

a new table with the layout hints set

Raises:

DHError

lazy_update(formulas)

The lazy_update method creates a new table containing a new, cached, formula column for each formula.

Parameters:

formulas (Union[str, Sequence[str]]) – the column formula(s)

Return type:

Table

Returns:

a new table

Raises:

DHError

max_by(by=None)

The max_by method creates a new table containing the maximum value for each group.

Parameters:

by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

median_by(by=None)

The median_by method creates a new table containing the median for each group.

Parameters:

by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

property meta_table

The column definitions of the table in a Table form.

min_by(by=None)

The min_by method creates a new table containing the minimum value for each group.

Parameters:

by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

move_columns(idx, cols)

The move_columns method creates a new table with specified columns moved to a specific column index value. Columns may be renamed with the same semantics as rename_columns. The renames are simultaneous and unordered, enabling direct swaps between column names. Specifying a source or destination more than once is prohibited.

Parameters:
  • idx (int) – the column index where the specified columns will be moved in the new table.

  • cols (Union[str, Sequence[str]]) – the column name(s) or the column rename expr(s) as “X = Y”

Return type:

Table

Returns:

a new table

Raises:

DHError

move_columns_down(cols)

The move_columns_down method creates a new table with specified columns appearing last in order, to the far right. Columns may be renamed with the same semantics as rename_columns. The renames are simultaneous and unordered, enabling direct swaps between column names. Specifying a source or destination more than once is prohibited.

Parameters:

cols (Union[str, Sequence[str]]) – the column name(s) or the column rename expr(s) as “X = Y”

Return type:

Table

Returns:

a new table

Raises:

DHError

move_columns_up(cols)

The move_columns_up method creates a new table with specified columns appearing first in order, to the far left. Columns may be renamed with the same semantics as rename_columns. The renames are simultaneous and unordered, enabling direct swaps between column names. Specifying a source or destination more than once is prohibited.

Parameters:

cols (Union[str, Sequence[str]]) – the column name(s) or the column rename expr(s) as “X = Y”

Return type:

Table

Returns:

a new table

Raises:

DHError

natural_join(table, on, joins=None)

The natural_join method creates a new table containing all the rows and columns of this table, plus additional columns containing data from the right table. For columns appended to the left table (joins), row values equal the row values from the right table where the key values in the left and right tables are equal. If there is no matching key in the right table, appended row values are NULL.

Parameters:
  • table (Table) – the right-table of the join

  • on (Union[str, Sequence[str]]) – the column(s) to match, can be a common name or an equal expression, i.e. “col_a = col_b” for different column names

  • joins (Union[str, Sequence[str]], optional) – the column(s) to be added from the right table to the result table, can be renaming expressions, i.e. “new_col = col”; default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

partition_by(by, drop_keys=False)

Creates a PartitionedTable from this table, partitioned according to the specified key columns.

Parameters:
  • by (Union[str, Sequence[str]]) – the column(s) by which to group data

  • drop_keys (bool) – whether to drop key columns in the constituent tables, default is False

Return type:

PartitionedTable

Returns:

A PartitionedTable containing a sub-table for each group

Raises:

DHError

partitioned_agg_by(aggs, by=None, preserve_empty=False, initial_groups=None)

The partitioned_agg_by method is a convenience method that performs an agg_by operation on this table and wraps the result in a PartitionedTable. If the argument ‘aggs’ does not include a partition aggregation created by calling agg.partition(), one will be added automatically with the default constituent column name __CONSTITUENT__.

Parameters:
  • aggs (Union[Aggregation, Sequence[Aggregation]]) – the aggregation(s)

  • by (Union[str, Sequence[str]]) – the group-by column name(s), default is None

  • preserve_empty (bool) – whether to keep result rows for groups that are initially empty or become empty as a result of updates. Each aggregation operator defines its own value for empty groups. Default is False.

  • initial_groups (Table) – a table whose distinct combinations of values for the group-by column(s) should be used to create an initial set of aggregation groups. All other columns are ignored. This is useful in combination with preserve_empty=True to ensure that particular groups appear in the result table, or with preserve_empty=False to control the encounter order for a collection of groups and thus their relative order in the result. Changes to this table are not expected or handled; if this table is a refreshing table, only its contents at instantiation time will be used. Default is None, the result will be the same as if a table is provided but no rows were supplied. When it is provided, the ‘by’ argument must be provided to explicitly specify the grouping columns.

Return type:

PartitionedTable

Returns:

a PartitionedTable

Raises:

DHError

raj(table, on, joins=None)

The reverse-as-of join method creates a new table containing all the rows and columns of the left table, plus additional columns containing data from the right table. For columns appended to the left table (joins), row values equal the row values from the right table where the keys from the left table most closely match the keys from the right table without going under. If there is no matching key in the right table, appended row values are NULL.

Parameters:
  • table (Table) – the right-table of the join

  • on (Union[str, Sequence[str]]) – the column(s) to match, can be a common name or a match condition of two columns, e.g. ‘col_a = col_b’. The first ‘N-1’ matches are exact matches. The final match is an inexact match. The inexact match can use either ‘<’ or ‘<=’. If a common name is used for the inexact match, ‘<=’ is used for the comparison.

  • joins (Union[str, Sequence[str]], optional) – the column(s) to be added from the right table to the result table, can be renaming expressions, i.e. “new_col = col”; default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

range_join(table, on, aggs)

The range_join method creates a new table containing all the rows and columns of the left table, plus additional columns containing aggregated data from the right table. For columns appended to the left table (joins), cell values equal aggregations over vectors of values from the right table. These vectors are formed from all values in the right table where the right table keys fall within the ranges of keys defined by the left table (responsive ranges).

range_join is a join plus aggregation that (1) joins arrays of data from the right table onto the left table, and then (2) aggregates over the joined data. Oftentimes this is used to join data for a particular time range from the right table onto the left table.

Rows from the right table with null or NaN key values are discarded; that is, they are never included in the vectors used for aggregation. For all rows that are not discarded, the right table must be sorted according to the right range column for all rows within a group.

Join key ranges, specified by the ‘on’ argument, are defined by zero-or-more exact join matches and a single range join match. The range join match must be the last match in the list.

The exact match expressions are parsed as in other join operations. That is, they are either a column name common to both tables or a column name from the left table followed by an equals sign followed by a column name from the right table. .. rubric:: Examples

Match on the same column name in both tables:

“common_column”

Match on different column names in each table:

“left_column = right_column” or “left_column == right_column”

The range match expression is expressed as a ternary logical expression, expressing the relationship between the left start column, the right range column, and the left end column. Each column name pair is separated by a logical operator, either < or <=. Additionally, the entire expression may be preceded by a left arrow <- and/or followed by a right arrow ->. The arrows indicate that range match can ‘allow preceding’ or ‘allow following’ to match values outside the explicit range. ‘Allow preceding’ means that if no matching right range column value is equal to the left start column value, the immediately preceding matching right row should be included in the aggregation if such a row exists. ‘Allow following’ means that if no matching right range column value is equal to the left end column value, the immediately following matching right row should be included in the aggregation if such a row exists. .. rubric:: Examples

For less than paired with greater than:

“left_start_column < right_range_column < left_end_column”

For less than or equal paired with greater than or equal:

“left_start_column <= right_range_column <= left_end_column”

For less than or equal (allow preceding) paired with greater than or equal (allow following):

“<- left_start_column <= right_range_column <= left_end_column ->”

Special Cases

In order to produce aggregated output, range match expressions must define a range of values to aggregate over. There are a few noteworthy special cases of ranges.

Empty Range An empty range occurs for any left row with no matching right rows. That is, no non-null, non-NaN right rows were found using the exact join matches, or none were in range according to the range join match.

Single-value Ranges A single-value range is a range where the left row’s values for the left start column and left end column are equal and both relative matches are inclusive (<= and >=, respectively). For a single-value range, only rows within the bucket where the right range column matches the single value are included in the output aggregations.

Invalid Ranges An invalid range occurs in two scenarios:

  1. When the range is inverted, i.e., when the value of the left start column is greater than the value of the left end column.

  2. When either relative-match is exclusive (< or >) and the value in the left start column is equal to the value in the left end column.

For invalid ranges, the result row will be null for all aggregation output columns.

Undefined Ranges An undefined range occurs when either the left start column or the left end column is NaN. For rows with an undefined range, the corresponding output values will be null (as with invalid ranges).

Unbounded Ranges A partially or fully unbounded range occurs when either the left start column or the left end column is null. If the left start column value is null and the left end column value is non-null, the range is unbounded at the beginning, and only the left end column subexpression will be used for the match. If the left start column value is non-null and the left end column value is null, the range is unbounded at the end, and only the left start column subexpression will be used for the match. If the left start column and left end column values are null, the range is unbounded, and all rows will be included.

Note: At this time, implementations only support static tables. This operation remains under active development.

Parameters:
  • table (Table) – the right table of the join

  • on (Union[str, List[str]]) – the match expression(s) that must include zero-or-more exact match expression, and exactly one range match expression as described above

  • aggs (Union[Aggregation, List[Aggregation]]) – the aggregation(s) to perform over the responsive ranges from the right table for each row from this Table

Return type:

Table

Returns:

a new table

Raises:

DHError

Returns a non-blink child table, or this table if it is not a blink table.

Return type:

Table

rename_columns(cols)

The rename_columns method creates a new table with the specified columns renamed. The renames are simultaneous and unordered, enabling direct swaps between column names. Specifying a source or

destination more than once is prohibited.

Parameters:

cols (Union[str, Sequence[str]]) – the column rename expr(s) as “X = Y”

Return type:

Table

Returns:

a new table

Raises:

DHError

restrict_sort_to(cols)

The restrict_sort_to method adjusts the input table to produce an output table that only allows sorting on specified table columns. This can be useful to prevent users from accidentally performing expensive sort operations as they interact with tables in the UI.

Parameters:

cols (Union[str, Sequence[str]]) – the column name(s)

Returns:

a new table

Raises:

DHError

reverse()

The reverse method creates a new table with all of the rows from this table in reverse order.

Return type:

Table

Returns:

a new table

Raises:

DHError

rollup(aggs, by=None, include_constituents=False)

Creates a rollup table.

A rollup table aggregates by the specified columns, and then creates a hierarchical table which re-aggregates using one less by column on each level. The column that is no longer part of the aggregation key is replaced with null on each level.

Note some aggregations can not be used in creating a rollup tables, these include: group, partition, median, pct, weighted_avg

Parameters:
  • aggs (Union[Aggregation, Sequence[Aggregation]]) – the aggregation(s)

  • by (Union[str, Sequence[str]]) – the group-by column name(s), default is None

  • include_constituents (bool) – whether to include the constituent rows at the leaf level, default is False

Return type:

RollupTable

Returns:

a new RollupTable

Raises:

DHError

select(formulas=None)

The select method creates a new in-memory table that includes one column for each formula. If no formula is specified, all columns will be included.

Parameters:

formulas (Union[str, Sequence[str]], optional) – the column formula(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

select_distinct(formulas=None)

The select_distinct method creates a new table containing all the unique values for a set of key columns. When the selectDistinct method is used on multiple columns, it looks for distinct sets of values in the selected columns.

Parameters:

formulas (Union[str, Sequence[str]], optional) – the column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

property size

The current number of rows in the table.

slice(start, stop)

Extracts a subset of a table by row positions into a new Table.

If both the start and the stop are positive, then both are counted from the beginning of the table. The start is inclusive, and the stop is exclusive. slice(0, N) is equivalent to head() (N) The start must be less than or equal to the stop.

If the start is positive and the stop is negative, then the start is counted from the beginning of the table, inclusively. The stop is counted from the end of the table. For example, slice(1, -1) includes all rows but the first and last. If the stop is before the start, the result is an empty table.

If the start is negative, and the stop is zero, then the start is counted from the end of the table, and the end of the slice is the size of the table. slice(-N, 0) is equivalent to tail() (N).

If the start is negative and the stop is negative, they are both counted from the end of the table. For example, slice(-2, -1) returns the second to last row of the table.

Parameters:
  • start (int) – the first row position to include in the result

  • stop (int) – the last row position to include in the result

Return type:

Table

Returns:

a new Table

Raises:

DHError

Examples

>>> table.slice(0, 5)    # first 5 rows
>>> table.slice(-5, 0)   # last 5 rows
>>> table.slice(2, 6)    # rows from index 2 to 5
>>> table.slice(6, 2)    # ERROR: cannot slice start after end
>>> table.slice(-6, -2)  # rows from 6th last to 2nd last (exclusive)
>>> table.slice(-2, -6)  # ERROR: cannot slice start after end
>>> table.slice(2, -3)   # all rows except the first 2 and the last 3
>>> table.slice(-6, 8)   # rows from 6th last to index 8 (exclusive)
slice_pct(start_pct, end_pct)

Extracts a subset of a table by row percentages.

Returns a subset of table in the range [floor(start_pct * size_of_table), floor(end_pct * size_of_table)). For example, for a table of size 10, slice_pct(0.1, 0.7) will return a subset from the second row to the seventh row. Similarly, slice_pct(0, 1) would return the entire table (because row positions run from 0 to size - 1). The percentage arguments must be in range [0, 1], otherwise the function returns an error.

Parameters:
  • start_pct (float) – the starting percentage point (inclusive) for rows to include in the result, range [0, 1]

  • end_pct (float) – the ending percentage point (exclusive) for rows to include in the result, range [0, 1]

Return type:

Table

Returns:

a new table

Raises:

DHError

snapshot()

Returns a static snapshot table.

Return type:

Table

Returns:

a new table

Raises:

DHError

snapshot_when(trigger_table, stamp_cols=None, initial=False, incremental=False, history=False)

Returns a table that captures a snapshot of this table whenever trigger_table updates.

When trigger_table updates, a snapshot of this table and the “stamp key” from trigger_table form the resulting table. The “stamp key” is the last row of the trigger_table, limited by the stamp_cols. If trigger_table is empty, the “stamp key” will be represented by NULL values.

Note: the trigger_table must be append-only when the history flag is set to True. If the trigger_table is not append-only and has modified or removed rows in its updates, the result snapshot table will be put in a failure state and become unusable.

Parameters:
  • trigger_table (Table) – the trigger table

  • stamp_cols (Union[str, Sequence[str]) – The columns from trigger_table that form the “stamp key”, may be renames. None, or empty, means that all columns from trigger_table form the “stamp key”.

  • initial (bool) – Whether to take an initial snapshot upon construction, default is False. When False, the resulting table will remain empty until trigger_table first updates.

  • incremental (bool) – Whether the resulting table should be incremental, default is False. When False, all rows of this table will have the latest “stamp key”. When True, only the rows of this table that have been added or updated will have the latest “stamp key”.

  • history (bool) – Whether the resulting table should keep history, default is False. A history table appends a full snapshot of this table and the “stamp key” as opposed to updating existing rows. The history flag is currently incompatible with initial and incremental: when history is True, incremental and initial must be False.

Return type:

Table

Returns:

a new table

Raises:

DHError

sort(order_by, order=None)

The sort method creates a new table where the rows are ordered based on values in a specified set of columns.

Parameters:
  • order_by (Union[str, Sequence[str]]) – the column(s) to be sorted on

  • order (Union[SortDirection, Sequence[SortDirection], optional) – the corresponding sort directions for each sort column, default is None, meaning ascending order for all the sort columns.

Return type:

Table

Returns:

a new table

Raises:

DHError

sort_descending(order_by)

The sort_descending method creates a new table where rows in a table are sorted in descending order based on the order_by column(s).

Parameters:

order_by (Union[str, Sequence[str]], optional) – the column name(s)

Return type:

Table

Returns:

a new table

Raises:

DHError

std_by(by=None)

The std_by method creates a new table containing the sample standard deviation for each group.

Sample standard deviation is computed using Bessel’s correction, which ensures that the sample variance will be an unbiased estimator of population variance.

Parameters:

by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

sum_by(by=None)

The sum_by method creates a new table containing the sum for each group.

Parameters:

by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

tail(num_rows)

The tail method creates a new table with a specific number of rows from the end of the table.

Parameters:

num_rows (int) – the number of rows at the end of table

Return type:

Table

Returns:

a new table

Raises:

DHError

tail_by(num_rows, by=None)

The tail_by method creates a new table containing the last number of rows for each group.

Parameters:
  • num_rows (int) – the number of rows at the end of each group

  • by (Union[str, Sequence[str]]) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

tail_pct(pct)

The tail_pct method creates a new table with a specific percentage of rows from the end of the table.

Parameters:

pct (float) – the percentage of rows to return as a value from 0 (0%) to 1 (100%).

Return type:

Table

Returns:

a new table

Raises:

DHError

to_string(num_rows=10, cols=None)

Returns the first few rows of a table as a pipe-delimited string.

Parameters:
  • num_rows (int) – the number of rows at the beginning of the table

  • cols (Union[str, Sequence[str]]) – the column name(s), default is None

Return type:

str

Returns:

string

Raises:

DHError

tree(id_col, parent_col, promote_orphans=False)

Creates a hierarchical tree table.

The structure of the table is encoded by an “id” and a “parent” column. The id column should represent a unique identifier for a given row, and the parent column indicates which row is the parent for a given row. Rows that have a None parent are part of the “root” table.

It is possible for rows to be “orphaned” if their parent is non-None and does not exist in the table. These rows will not be present in the resulting tree. If this is not desirable, they could be promoted to become children of the root table by setting ‘promote_orphans’ argument to True.

Parameters:
  • id_col (str) – the name of a column containing a unique identifier for a particular row in the table

  • parent_col (str) – the name of a column containing the parent’s identifier, {@code null} for rows that are part of the root table

  • promote_orphans (bool) – whether to promote node tables whose parents don’t exist to be children of the root node, default is False

Return type:

TreeTable

Returns:

a new TreeTable organized according to the parent-child relationships expressed by id_col and parent_col

Raises:

DHError

ungroup(cols=None)

The ungroup method creates a new table in which array columns from the source table are unwrapped into separate rows.

Parameters:

cols (Union[str, Sequence[str]], optional) – the name(s) of the array column(s), if None, all array columns will be ungrouped, default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

update(snapshot_id=None)[source]

Updates the table to match the contents of the specified snapshot. This may result in row removes and additions that will be propagated asynchronously via this IcebergTable’s UpdateGraph. If no snapshot is provided, the most recent snapshot is used.

NOTE: this method is only valid when the table is in manual_refresh() mode. Iceberg tables in static() or auto_refresh() mode cannot be updated manually and will throw an exception if this method is called.

Parameters:

snapshot_id (Optional[int]) – the snapshot id to update to; if omitted the most recent snapshot will be used.

Raises:

DHError – If unable to update the Iceberg table.

update_by(ops, by=None)

Creates a table with additional columns calculated from window-based aggregations of columns in this table. The aggregations are defined by the provided operations, which support incremental aggregations over the corresponding rows in the table. The aggregations will apply position or time-based windowing and compute the results over the entire table or each row group as identified by the provided key columns.

Parameters:
  • ops (Union[UpdateByOperation, List[UpdateByOperation]]) – the update-by operation definition(s)

  • by (Union[str, List[str]]) – the key column name(s) to group the rows of the table

Return type:

Table

Returns:

a new Table

Raises:

DHError

property update_graph

The update graph of the table.

update_view(formulas)

The update_view method creates a new table containing a new, formula column for each formula.

Parameters:

formulas (Union[str, Sequence[str]]) – the column formula(s)

Return type:

Table

Returns:

a new table

Raises:

DHError

var_by(by=None)

The var_by method creates a new table containing the sample variance for each group.

Sample variance is computed using Bessel’s correction, which ensures that the sample variance will be an unbiased estimator of population variance.

Parameters:

by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

view(formulas)

The view method creates a new formula table that includes one column for each formula.

Parameters:

formulas (Union[str, Sequence[str]]) – the column formula(s)

Return type:

Table

Returns:

a new table

Raises:

DHError

weighted_avg_by(wcol, by=None)

The weighted_avg_by method creates a new table containing the weighted average for each group.

Parameters:
  • wcol (str) – the name of the weight column

  • by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

weighted_sum_by(wcol, by=None)

The weighted_sum_by method creates a new table containing the weighted sum for each group.

Parameters:
  • wcol (str) – the name of the weight column

  • by (Union[str, Sequence[str]], optional) – the group-by column name(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

where(filters=None)

The where method creates a new table with only the rows meeting the filter criteria in the column(s) of the table.

Parameters:

filters (Union[str, Filter, Sequence[str], Sequence[Filter]], optional) – the filter condition expression(s) or Filter object(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

where_in(filter_table, cols)

The where_in method creates a new table containing rows from the source table, where the rows match values in the filter table. The filter is updated whenever either table changes.

Parameters:
  • filter_table (Table) – the table containing the set of values to filter on

  • cols (Union[str, Sequence[str]]) – the column name(s)

Return type:

Table

Returns:

a new table

Raises:

DHError

where_not_in(filter_table, cols)

The where_not_in method creates a new table containing rows from the source table, where the rows do not match values in the filter table.

Parameters:
  • filter_table (Table) – the table containing the set of values to filter on

  • cols (Union[str, Sequence[str]]) – the column name(s)

Return type:

Table

Returns:

a new table

Raises:

DHError

where_one_of(filters=None)

The where_one_of method creates a new table containing rows from the source table, where the rows match at least one filter.

Parameters:

filters (Union[str, Filter, Sequence[str], Sequence[Filter]], optional) – the filter condition expression(s), default is None

Return type:

Table

Returns:

a new table

Raises:

DHError

with_attributes(attrs)

Returns a new Table that has the provided attributes defined on it and shares the underlying data and schema with this table.

Note, the table attributes are immutable once defined, and are mostly used internally by the Deephaven engine. For advanced users, certain predefined plug-in attributes provide a way to extend Deephaven with custom-built plug-ins.

Parameters:

attrs (Dict[str, Any]) – a dict of table attribute names and their values

Return type:

Table

Returns:

a new Table

Raises:

DHError

without_attributes(attrs)

Returns a new Table that shares the underlying data and schema with this table but with the specified attributes removed.

Parameters:

attrs (Union[str, Sequence[str]]) – the attribute name(s) to be removed

Return type:

Table

Returns:

a new Table

Raises:

DHError

class IcebergTableAdapter(j_object)[source]

Bases: JObjectWrapper

This class provides an interface for interacting with Iceberg tables. It allows the user to list snapshots, retrieve table definitions and reading Iceberg tables into Deephaven tables.

definition(instructions=None)[source]

Returns the Deephaven table definition as a Deephaven table.

Parameters:

instructions (Optional[IcebergReadInstructions]) – the instructions for reading the table. These instructions can include column renames, table definition, and specific data instructions for reading the data files from the provider. If omitted, the table will be read with default instructions.

Return type:

Table

Returns:

a table containing the table definition.

j_object_type

alias of IcebergTableAdapter

snapshots()[source]

Returns information on the snapshots of this table as a Deephaven table. The table contains the following columns: - Id: the snapshot identifier (can be used for updating the table or loading a specific snapshot). - TimestampMs: the timestamp of the snapshot. - Operation: the data operation that created this snapshot. - Summary: additional information about this snapshot from the Iceberg metadata. - SnapshotObject: a Java object containing the Iceberg API snapshot.

Return type:

Table

Returns:

a table containing the snapshot information.

table(instructions=None)[source]

Reads the table using the provided instructions. Optionally, a snapshot id can be provided to read a specific snapshot of the table.

Parameters:

instructions (Optional[IcebergReadInstructions]) – the instructions for reading the table. These instructions can include column renames, table definition, and specific data instructions for reading the data files from the provider. If omitted, the table will be read in static() mode without column renames or data instructions.

Returns:

the table read from the catalog.

Return type:

Table

class IcebergUpdateMode(mode)[source]

Bases: JObjectWrapper

This class specifies the update mode for an Iceberg table to be loaded into Deephaven. The modes are:

classmethod auto_refresh(auto_refresh_ms=None)[source]

Creates an IcebergUpdateMode with auto-refreshing enabled.

Parameters:

auto_refresh_ms (int) – the refresh interval in milliseconds; if omitted, the default of 60 seconds is used.

Return type:

IcebergUpdateMode

j_object_type

alias of IcebergUpdateMode

classmethod manual_refresh()[source]

Creates an IcebergUpdateMode with manual refreshing enabled.

Return type:

IcebergUpdateMode

classmethod static()[source]

Creates an IcebergUpdateMode with no refreshing supported.

Return type:

IcebergUpdateMode

adapter(name=None, properties=None, hadoop_config=None, s3_instructions=None)[source]

Create an Iceberg catalog adapter from configuration properties. These properties map to the Iceberg catalog Java API properties and are used to select the catalog and file IO implementations.

The minimal set of properties required to create an Iceberg catalog are the following: - catalog-impl or type - the Java catalog implementation to use. When providing catalog-impl, the

implementing Java class should be provided (e.g. org.apache.iceberg.rest.RESTCatalog or org.apache.iceberg.aws.glue.GlueCatalog). Choices for type include hive, hadoop, rest, glue, nessie, jdbc.

To ensure consistent behavior across Iceberg-managed and Deephaven-managed AWS clients, it’s recommended to use the s3_instructions parameter to specify AWS/S3 connectivity details. This approach offers a high degree of parity in construction logic.

For complex use cases, consider using S3 Instruction profiles, which provide extensive configuration options. When set, these instructions are automatically included in the IcebergInstructions.__init__() data_instructions.

If you prefer to use Iceberg’s native AWS properties, Deephaven will attempt to infer the necessary construction logic. However, in advanced scenarios, there might be discrepancies between the two clients, potentially leading to limitations like being able to browse catalog metadata but not retrieve table data.

Other common properties include: - uri - the URI of the catalog - warehouse - the root path of the data warehouse.

Example usage #1 - REST catalog with an S3 backend: ``` from deephaven.experimental import iceberg from deephaven.experimental.s3 import S3Instructions, Credentials

adapter = iceberg.adapter(

name=”MyCatalog”, properties={

“type”: “rest”, “uri”: “http://my-rest-catalog:8181/api”, # Note: Other properties may be needed depending on the REST Catalog implementation # “warehouse”: “catalog-id”, # “credential”: “username:password”

}, s3_instructions=S3Instructions(

region_name=”us-east-1”, credentials=Credentials.basic(“my_access_key_id”, “my_secret_access_key”),

),

)

Example usage #2 - AWS Glue catalog: ``` from deephaven.experimental import iceberg from deephaven.experimental.s3 import S3Instructions

# Note: region and credential information will be loaded from the specified profile adapter = iceberg.adapter(

name=”MyCatalog”, properties={

“type”: “glue”, “uri”: “s3://lab-warehouse/sales”,

}, s3_instructions=S3Instructions(

profile_name=”MyGlueProfile”,

),

)

type name:

Optional[str]

param name:

a descriptive name of the catalog; if omitted the catalog name is inferred from the catalog URI property.

type name:

Optional[str]

type properties:

Optional[Dict[str, str]]

param properties:

the properties of the catalog to load

type properties:

Optional[Dict[str, str]]

type hadoop_config:

Optional[Dict[str, str]]

param hadoop_config:

hadoop configuration properties for the catalog to load

type hadoop_config:

Optional[Dict[str, str]]

type s3_instructions:

Optional[S3Instructions]

param s3_instructions:

the S3 instructions if applicable

type s3_instructions:

Optional[s3.S3Instructions]

returns:

the catalog adapter created from the provided properties

rtype:

IcebergCatalogAdapter

raises DHError:

If unable to build the catalog adapter

adapter_aws_glue(catalog_uri, warehouse_location, name=None)[source]

Create a catalog adapter using an AWS Glue catalog.

Parameters:
  • catalog_uri (str) – the URI of the REST catalog.

  • warehouse_location (str) – the location of the warehouse.

  • name (Optional[str]) – a descriptive name of the catalog; if omitted the catalog name is inferred from the catalog URI.

Returns:

the catalog adapter for the provided AWS Glue catalog.

Return type:

IcebergCatalogAdapter

Raises:

DHError – If unable to build the catalog adapter.

adapter_s3_rest(catalog_uri, warehouse_location, name=None, region_name=None, access_key_id=None, secret_access_key=None, end_point_override=None)[source]

Create a catalog adapter using an S3-compatible provider and a REST catalog.

Parameters:
  • catalog_uri (str) – the URI of the REST catalog.

  • warehouse_location (str) – the location of the warehouse.

  • name (Optional[str]) – a descriptive name of the catalog; if omitted the catalog name is inferred from the catalog URI.

  • region_name (Optional[str]) – the S3 region name to use; If not provided, the default region will be picked by the AWS SDK from ‘aws.region’ system property, “AWS_REGION” environment variable, the {user.home}/.aws/credentials or {user.home}/.aws/config files, or from EC2 metadata service, if running in EC2.

  • access_key_id (Optional[str]) – the access key for reading files. Both access key and secret access key must be provided to use static credentials, else default credentials will be used.

  • secret_access_key (Optional[str]) – the secret access key for reading files. Both access key and secret key must be provided to use static credentials, else default credentials will be used.

  • end_point_override (Optional[str]) – the S3 endpoint to connect to. Callers connecting to AWS do not typically need to set this; it is most useful when connecting to non-AWS, S3-compatible APIs.

Returns:

the catalog adapter for the provided S3 REST catalog.

Return type:

IcebergCatalogAdapter

Raises:

DHError – If unable to build the catalog adapter.