deephaven.parquet

This module supports reading an external Parquet files into Deephaven tables and writing Deephaven tables out as Parquet files.

class ColumnInstruction(column_name=None, parquet_column_name=None, codec_name=None, codec_args=None, use_dictionary=False)[source]

Bases: object

This class specifies the instructions for reading/writing a Parquet column.

class ParquetFileLayout(value)[source]

Bases: Enum

The parquet file layout.

FLAT_PARTITIONED = 2

A single directory of parquet files.

KV_PARTITIONED = 3

A hierarchically partitioned directory layout of parquet files. Directory names are of the format “key=value” with keys derived from the partitioning columns.

METADATA_PARTITIONED = 4

Layout can be used to describe: - A directory containing a METADATA_FILE_NAME parquet file and an optional COMMON_METADATA_FILE_NAME parquet file - A single parquet METADATA_FILE_NAME file - A single parquet COMMON_METADATA_FILE_NAME file

SINGLE_FILE = 1

A single parquet file.

batch_write(tables, paths, table_definition=None, col_instructions=None, compression_codec_name=None, max_dictionary_keys=None, max_dictionary_size=None, target_page_size=None, generate_metadata_files=None, index_columns=None)[source]

Writes tables to disk in parquet format to a supplied set of paths.

Note that either all the tables are written out successfully or none is.

Parameters:
  • tables (List[Table]) – the source tables

  • paths (List[str]) – the destination paths. Any non-existing directories in the paths provided are created. If there is an error, any intermediate directories previously created are removed; note this makes this method unsafe for concurrent use

  • table_definition (Optional[Union[Dict[str, DType], List[Column]]]) – the table definition to use for writing. This definition can be used to skip some columns or add additional columns with null values. Default is None, which means if all tables have the same definition, use the common table definition implied by the tables. Otherwise, this parameter must be specified.

  • col_instructions (Optional[List[ColumnInstruction]]) – instructions for customizations while writing

  • compression_codec_name (Optional[str]) – the compression codec to use. Allowed values include “UNCOMPRESSED”, “SNAPPY”, “GZIP”, “LZO”, “LZ4”, “LZ4_RAW”, “ZSTD”, etc. If not specified, defaults to “SNAPPY”.

  • max_dictionary_keys (Optional[int]) – the maximum number of unique keys the writer should add to a dictionary page before switching to non-dictionary encoding; never evaluated for non-String columns , defaults to 2^20 (1,048,576)

  • max_dictionary_size (Optional[int]) – the maximum number of bytes the writer should add to the dictionary before switching to non-dictionary encoding, never evaluated for non-String columns, defaults to 2^20 (1,048,576)

  • target_page_size (Optional[int]) – the target page size in bytes, if not specified, defaults to 2^20 bytes (1 MiB)

  • generate_metadata_files (Optional[bool]) – whether to generate parquet _metadata and _common_metadata files, defaults to False. Generating these files can help speed up reading of partitioned parquet data because these files contain metadata (including schema) about the entire dataset, which can be used to skip reading some files.

  • index_columns (Optional[Sequence[Sequence[str]]]) – sequence of sequence containing the column names for indexes to persist. The write operation will store the index info for the provided columns as sidecar tables. For example, if the input is [[“Col1”], [“Col1”, “Col2”]], the write operation will store the index info for [“Col1”] and for [“Col1”, “Col2”]. By default, data indexes to write are determined by those present on the source table. This argument can be used to narrow the set of indexes to write, or to be explicit about the expected set of indexes present on all sources. Indexes that are specified but missing will be computed on demand.

Raises:

DHError

delete(path)[source]

Deletes a Parquet table on disk.

Parameters:

path (str) – path to delete

Raises:

DHError

Return type:

None

read(path, col_instructions=None, is_legacy_parquet=False, is_refreshing=False, file_layout=None, table_definition=None, special_instructions=None)[source]

Reads in a table from a single parquet, metadata file, or directory with recognized layout.

Parameters:
  • path (str) – the file or directory to examine

  • col_instructions (Optional[List[ColumnInstruction]]) – instructions for customizations while reading particular columns, default is None, which means no specialization for any column

  • is_legacy_parquet (bool) – if the parquet data is legacy

  • is_refreshing (bool) – if the parquet data represents a refreshing source

  • file_layout (Optional[ParquetFileLayout]) – the parquet file layout, by default None. When None, the layout is inferred.

  • table_definition (Union[Dict[str, DType], List[Column], None]) – the table definition, by default None. When None, the definition is inferred from the parquet file(s). Setting a definition guarantees the returned table will have that definition. This is useful for bootstrapping purposes when the initially partitioned directory is empty and is_refreshing=True. It is also useful for specifying a subset of the parquet definition.

  • special_instructions (Optional[s3.S3Instructions]) – Special instructions for reading parquet files, useful when reading files from a non-local file system, like S3. By default, None.

Return type:

Table

Returns:

a table

Raises:

DHError

write(table, path, table_definition=None, col_instructions=None, compression_codec_name=None, max_dictionary_keys=None, max_dictionary_size=None, target_page_size=None, generate_metadata_files=None, index_columns=None)[source]

Write a table to a Parquet file.

Parameters:
  • table (Table) – the source table

  • path (str) – the destination file path; the file name should end in a “.parquet” extension. If the path includes any non-existing directories, they are created. If there is an error, any intermediate directories previously created are removed; note this makes this method unsafe for concurrent use

  • table_definition (Optional[Union[Dict[str, DType], List[Column]]) – the table definition to use for writing, instead of the definitions implied by the table. Default is None, which means use the column definitions implied by the table. This definition can be used to skip some columns or add additional columns with null values.

  • col_instructions (Optional[List[ColumnInstruction]]) – instructions for customizations while writing particular columns, default is None, which means no specialization for any column

  • compression_codec_name (Optional[str]) – the compression codec to use. Allowed values include “UNCOMPRESSED”, “SNAPPY”, “GZIP”, “LZO”, “LZ4”, “LZ4_RAW”, “ZSTD”, etc. If not specified, defaults to “SNAPPY”.

  • max_dictionary_keys (Optional[int]) – the maximum number of unique keys the writer should add to a dictionary page before switching to non-dictionary encoding, never evaluated for non-String columns, defaults to 2^20 (1,048,576)

  • max_dictionary_size (Optional[int]) – the maximum number of bytes the writer should add to the dictionary before switching to non-dictionary encoding, never evaluated for non-String columns, defaults to 2^20 (1,048,576)

  • target_page_size (Optional[int]) – the target page size in bytes, if not specified, defaults to 2^20 bytes (1 MiB)

  • generate_metadata_files (Optional[bool]) – whether to generate parquet _metadata and _common_metadata files, defaults to False. Generating these files can help speed up reading of partitioned parquet data because these files contain metadata (including schema) about the entire dataset, which can be used to skip reading some files.

  • index_columns (Optional[Sequence[Sequence[str]]]) – sequence of sequence containing the column names for indexes to persist. The write operation will store the index info for the provided columns as sidecar tables. For example, if the input is [[“Col1”], [“Col1”, “Col2”]], the write operation will store the index info for [“Col1”] and for [“Col1”, “Col2”]. By default, data indexes to write are determined by those present on the source table. This argument can be used to narrow the set of indexes to write, or to be explicit about the expected set of indexes present on all sources. Indexes that are specified but missing will be computed on demand.

Raises:

DHError

Return type:

None

write_partitioned(table, destination_dir, table_definition=None, col_instructions=None, compression_codec_name=None, max_dictionary_keys=None, max_dictionary_size=None, target_page_size=None, base_name=None, generate_metadata_files=None, index_columns=None)[source]

Write table to disk in parquet format with the partitioning columns written as “key=value” format in a nested directory structure. For example, for a partitioned column “date”, we will have a directory structure like “date=2021-01-01/<base_name>.parquet”, “date=2021-01-02/<base_name>.parquet”, etc. where “2021-01-01” and “2021-01-02” are the partition values and “<base_name>” is passed as an optional parameter. All the necessary subdirectories are created if they do not exist.

Parameters:
  • table (Table) – the source table or partitioned table

  • destination_dir (str) – The path to destination root directory in which the partitioned parquet data will be stored in a nested directory structure format. Non-existing directories in the provided path will be created.

  • table_definition (Optional[Union[Dict[str, DType], List[Column]]) – the table definition to use for writing, instead of the definitions implied by the table. Default is None, which means use the column definitions implied by the table. This definition can be used to skip some columns or add additional columns with null values.

  • col_instructions (Optional[List[ColumnInstruction]]) – instructions for customizations while writing particular columns, default is None, which means no specialization for any column

  • compression_codec_name (Optional[str]) – the compression codec to use. Allowed values include “UNCOMPRESSED”, “SNAPPY”, “GZIP”, “LZO”, “LZ4”, “LZ4_RAW”, “ZSTD”, etc. If not specified, defaults to “SNAPPY”.

  • max_dictionary_keys (Optional[int]) – the maximum number of unique keys the writer should add to a dictionary page before switching to non-dictionary encoding; never evaluated for non-String columns , defaults to 2^20 (1,048,576)

  • max_dictionary_size (Optional[int]) – the maximum number of bytes the writer should add to the dictionary before switching to non-dictionary encoding, never evaluated for non-String columns, defaults to 2^20 (1,048,576)

  • target_page_size (Optional[int]) – the target page size in bytes, if not specified, defaults to 2^20 bytes (1 MiB)

  • base_name (Optional[str]) – The base name for the individual partitioned tables, if not specified, defaults to {uuid}, so files will have names of the format <uuid>.parquet where uuid is a randomly generated UUID. Users can provide the following tokens in the base_name: - The token {uuid} will be replaced with a random UUID. For example, a base name of “table-{uuid}” will result in files named like “table-8e8ab6b2-62f2-40d1-8191-1c5b70c5f330.parquet.parquet”. - The token {partitions} will be replaced with an underscore-delimited, concatenated string of partition values. For example, for a base name of “{partitions}-table” and partitioning columns “PC1” and “PC2”, the file name is generated by concatenating the partition values “PC1=pc1” and “PC2=pc2” with an underscore followed by “-table.parquet”, like “PC1=pc1_PC2=pc2-table.parquet”. - The token {i} will be replaced with an automatically incremented integer for files in a directory. For example, a base name of “table-{i}” will result in files named like “PC=partition1/table-0.parquet”, “PC=partition1/table-1.parquet”, etc.

  • generate_metadata_files (Optional[bool]) – whether to generate parquet _metadata and _common_metadata files, defaults to False. Generating these files can help speed up reading of partitioned parquet data because these files contain metadata (including schema) about the entire dataset, which can be used to skip reading some files.

  • index_columns (Optional[Sequence[Sequence[str]]]) – sequence of sequence containing the column names for indexes to persist. The write operation will store the index info for the provided columns as sidecar tables. For example, if the input is [[“Col1”], [“Col1”, “Col2”]], the write operation will store the index info for [“Col1”] and for [“Col1”, “Col2”]. By default, data indexes to write are determined by those present on the source table. This argument can be used to narrow the set of indexes to write, or to be explicit about the expected set of indexes present on all sources. Indexes that are specified but missing will be computed on demand.

Raises:

DHError

Return type:

None