deephaven.parquet#
This module supports reading an external Parquet files into Deephaven tables and writing Deephaven tables out as Parquet files.
- class ColumnInstruction(column_name=None, parquet_column_name=None, codec_name=None, codec_args=None, use_dictionary=False)[source]#
Bases:
object
This class specifies the instructions for reading/writing a Parquet column.
- class ParquetFileLayout(value)[source]#
Bases:
Enum
The parquet file layout.
- FLAT_PARTITIONED = 2#
A single directory of parquet files.
- KV_PARTITIONED = 3#
A key-value directory partitioning of parquet files.
- METADATA_PARTITIONED = 4#
A directory containing a _metadata parquet file and an optional _common_metadata parquet file.
- SINGLE_FILE = 1#
A single parquet file.
- batch_write(tables, paths, col_definitions, col_instructions=None, compression_codec_name=None, max_dictionary_keys=None, max_dictionary_size=None, target_page_size=None, grouping_cols=None)[source]#
Writes tables to disk in parquet format to a supplied set of paths.
If you specify grouping columns, there must already be grouping information for those columns in the sources. This can be accomplished with .groupBy(<grouping columns>).ungroup() or .sort(<grouping column>).
Note that either all the tables are written out successfully or none is.
- Parameters:
tables (List[Table]) – the source tables
paths (List[str]) – the destinations paths. Any non existing directories in the paths provided are created. If there is an error, any intermediate directories previously created are removed; note this makes this method unsafe for concurrent use
col_definitions (List[Column]) – the column definitions to use
col_instructions (Optional[List[ColumnInstruction]]) – instructions for customizations while writing
compression_codec_name (Optional[str]) – the compression codec to use, if not specified, defaults to SNAPPY
max_dictionary_keys (Optional[int]) – the maximum dictionary keys allowed, if not specified, defaults to 2^20 (1,048,576)
max_dictionary_size (Optional[int]) – the maximum dictionary size (in bytes) allowed, defaults to 2^20 (1,048,576)
target_page_size (Optional[int]) – the target page size in bytes, if not specified, defaults to 2^20 bytes (1 MiB)
grouping_cols (Optional[List[str]]) – the group column names
- Raises:
DHError –
- delete(path)[source]#
Deletes a Parquet table on disk.
- Parameters:
path (str) – path to delete
- Raises:
DHError –
- Return type:
None
- read(path, col_instructions=None, is_legacy_parquet=False, is_refreshing=False, file_layout=None, table_definition=None, special_instructions=None)[source]#
Reads in a table from a single parquet, metadata file, or directory with recognized layout.
- Parameters:
path (str) – the file or directory to examine
col_instructions (Optional[List[ColumnInstruction]]) – instructions for customizations while reading, None by default.
is_legacy_parquet (bool) – if the parquet data is legacy
is_refreshing (bool) – if the parquet data represents a refreshing source
file_layout (Optional[ParquetFileLayout]) – the parquet file layout, by default None. When None, the layout is inferred.
table_definition (Union[Dict[str, DType], List[Column], None]) – the table definition, by default None. When None, the definition is inferred from the parquet file(s). Setting a definition guarantees the returned table will have that definition. This is useful for bootstrapping purposes when the initially partitioned directory is empty and is_refreshing=True. It is also useful for specifying a subset of the parquet definition. When set, file_layout must also be set.
special_instructions (Optional[s3.S3Instructions]) – Special instructions for reading parquet files, useful when reading files from a non-local file system, like S3. By default, None.
- Return type:
- Returns:
a table
- Raises:
DHError –
- write(table, path, col_definitions=None, col_instructions=None, compression_codec_name=None, max_dictionary_keys=None, max_dictionary_size=None, target_page_size=None)[source]#
Write a table to a Parquet file.
- Parameters:
table (Table) – the source table
path (str) – the destination file path; the file name should end in a “.parquet” extension. If the path includes non-existing directories they are created. If there is an error, any intermediate directories previously created are removed; note this makes this method unsafe for concurrent use
col_definitions (Optional[List[Column]]) – the column definitions to use, default is None
col_instructions (Optional[List[ColumnInstruction]]) – instructions for customizations while writing, default is None
compression_codec_name (Optional[str]) – the default compression codec to use, if not specified, defaults to SNAPPY
max_dictionary_keys (Optional[int]) – the maximum dictionary keys allowed, if not specified, defaults to 2^20 (1,048,576)
max_dictionary_size (Optional[int]) – the maximum dictionary size (in bytes) allowed, defaults to 2^20 (1,048,576)
target_page_size (Optional[int]) – the target page size in bytes, if not specified, defaults to 2^20 bytes (1 MiB)
- Raises:
DHError –
- Return type:
None