deephaven.parquet#

This module supports reading an external Parquet files into Deephaven tables and writing Deephaven tables out as Parquet files.

class ColumnInstruction(column_name=None, parquet_column_name=None, codec_name=None, codec_args=None, use_dictionary=False)[source]#

Bases: object

This class specifies the instructions for reading/writing a Parquet column.

class ParquetFileLayout(value)[source]#

Bases: Enum

The parquet file layout.

FLAT_PARTITIONED = 2#

A single directory of parquet files.

KV_PARTITIONED = 3#

A key-value directory partitioning of parquet files.

METADATA_PARTITIONED = 4#

A directory containing a _metadata parquet file and an optional _common_metadata parquet file.

SINGLE_FILE = 1#

A single parquet file.

batch_write(tables, paths, col_definitions, col_instructions=None, compression_codec_name=None, max_dictionary_keys=None, max_dictionary_size=None, target_page_size=None, grouping_cols=None)[source]#

Writes tables to disk in parquet format to a supplied set of paths.

If you specify grouping columns, there must already be grouping information for those columns in the sources. This can be accomplished with .groupBy(<grouping columns>).ungroup() or .sort(<grouping column>).

Note that either all the tables are written out successfully or none is.

Parameters:
  • tables (List[Table]) – the source tables

  • paths (List[str]) – the destinations paths. Any non existing directories in the paths provided are created. If there is an error, any intermediate directories previously created are removed; note this makes this method unsafe for concurrent use

  • col_definitions (List[Column]) – the column definitions to use

  • col_instructions (Optional[List[ColumnInstruction]]) – instructions for customizations while writing

  • compression_codec_name (Optional[str]) – the compression codec to use, if not specified, defaults to SNAPPY

  • max_dictionary_keys (Optional[int]) – the maximum dictionary keys allowed, if not specified, defaults to 2^20 (1,048,576)

  • max_dictionary_size (Optional[int]) – the maximum dictionary size (in bytes) allowed, defaults to 2^20 (1,048,576)

  • target_page_size (Optional[int]) – the target page size in bytes, if not specified, defaults to 2^20 bytes (1 MiB)

  • grouping_cols (Optional[List[str]]) – the group column names

Raises:

DHError

delete(path)[source]#

Deletes a Parquet table on disk.

Parameters:

path (str) – path to delete

Raises:

DHError

Return type:

None

read(path, col_instructions=None, is_legacy_parquet=False, is_refreshing=False, file_layout=None, table_definition=None, special_instructions=None)[source]#

Reads in a table from a single parquet, metadata file, or directory with recognized layout.

Parameters:
  • path (str) – the file or directory to examine

  • col_instructions (Optional[List[ColumnInstruction]]) – instructions for customizations while reading, None by default.

  • is_legacy_parquet (bool) – if the parquet data is legacy

  • is_refreshing (bool) – if the parquet data represents a refreshing source

  • file_layout (Optional[ParquetFileLayout]) – the parquet file layout, by default None. When None, the layout is inferred.

  • table_definition (Union[Dict[str, DType], List[Column], None]) – the table definition, by default None. When None, the definition is inferred from the parquet file(s). Setting a definition guarantees the returned table will have that definition. This is useful for bootstrapping purposes when the initially partitioned directory is empty and is_refreshing=True. It is also useful for specifying a subset of the parquet definition. When set, file_layout must also be set.

  • special_instructions (Optional[s3.S3Instructions]) – Special instructions for reading parquet files, useful when reading files from a non-local file system, like S3. By default, None.

Return type:

Table

Returns:

a table

Raises:

DHError

write(table, path, col_definitions=None, col_instructions=None, compression_codec_name=None, max_dictionary_keys=None, max_dictionary_size=None, target_page_size=None)[source]#

Write a table to a Parquet file.

Parameters:
  • table (Table) – the source table

  • path (str) – the destination file path; the file name should end in a “.parquet” extension. If the path includes non-existing directories they are created. If there is an error, any intermediate directories previously created are removed; note this makes this method unsafe for concurrent use

  • col_definitions (Optional[List[Column]]) – the column definitions to use, default is None

  • col_instructions (Optional[List[ColumnInstruction]]) – instructions for customizations while writing, default is None

  • compression_codec_name (Optional[str]) – the default compression codec to use, if not specified, defaults to SNAPPY

  • max_dictionary_keys (Optional[int]) – the maximum dictionary keys allowed, if not specified, defaults to 2^20 (1,048,576)

  • max_dictionary_size (Optional[int]) – the maximum dictionary size (in bytes) allowed, defaults to 2^20 (1,048,576)

  • target_page_size (Optional[int]) – the target page size in bytes, if not specified, defaults to 2^20 bytes (1 MiB)

Raises:

DHError

Return type:

None