Class ParquetTools

java.lang.Object
io.deephaven.parquet.table.ParquetTools

public class ParquetTools extends Object
Tools for managing and manipulating tables on disk in parquet format.
  • Field Details

  • Method Details

    • readTable

      public static Table readTable(@NotNull @NotNull String source)
      Reads in a table from a single parquet file, metadata file, or directory with recognized layout. The source provided can be a local file path or a URI to be resolved.

      This method attempts to "do the right thing." It examines the source to determine if it's a single parquet file, a metadata file, or a directory. If it's a directory, it additionally tries to guess the layout to use. Unless a metadata file is supplied or discovered in the directory, the highest (by location key order) location found will be used to infer schema.

      Parameters:
      source - The path or URI of file or directory to examine
      Returns:
      table
      See Also:
    • readTable

      public static Table readTable(@NotNull @NotNull String source, @NotNull @NotNull ParquetInstructions readInstructions)
      Reads in a table from a single parquet file, metadata file, or directory with recognized layout. The source provided can be a local file path or a URI to be resolved.

      If the ParquetInstructions.ParquetFileLayout is not provided in the instructions, this method attempts to "do the right thing." It examines the source to determine if it's a single parquet file, a metadata file, or a directory. If it's a directory, it additionally tries to guess the layout to use. Unless a metadata file is supplied or discovered in the directory, the highest (by location key order) location found will be used to infer schema.

      Parameters:
      source - The path or URI of file or directory to examine
      readInstructions - Instructions for customizations while reading
      Returns:
      table
      See Also:
    • writeTable

      public static void writeTable(@NotNull @NotNull Table sourceTable, @NotNull @NotNull String destination)
      Write a table to a file. Data indexes to write are determined by those present on sourceTable.
      Parameters:
      sourceTable - source table
      destination - destination path or URI; the file name should end in ".parquet" extension. If the path includes non-existing directories, they are created. If there is an error any intermediate directories previously created are removed; note this makes this method unsafe for concurrent use
    • writeTable

      public static void writeTable(@NotNull @NotNull Table sourceTable, @NotNull @NotNull String destination, @NotNull @NotNull ParquetInstructions writeInstructions)
      Write a table to a file. Data indexes to write are determined by those present on sourceTable.
      Parameters:
      sourceTable - source table
      destination - destination path or URI; the file name should end in ".parquet" extension. If the path includes non-existing directories, they are created. If there is an error any intermediate directories previously created are removed; note this makes this method unsafe for concurrent use
      writeInstructions - instructions for customizations while writing
    • legacyGroupingFileName

      @VisibleForTesting public static String legacyGroupingFileName(@NotNull @NotNull File tableDest, @NotNull @NotNull String columnName)
      Legacy method for generating a grouping file name. We used to place grouping files right next to the original table destination.
      Parameters:
      tableDest - Destination path for the main table containing these grouping columns
      columnName - Name of the grouping column
      Returns:
      The relative grouping file path. For example, for table with destination "table.parquet" and grouping column "GroupingColName", the method will return "table_GroupingColName_grouping.parquet"
    • writeKeyValuePartitionedTable

      public static void writeKeyValuePartitionedTable(@NotNull @NotNull Table sourceTable, @NotNull @NotNull String destinationDir, @NotNull @NotNull ParquetInstructions writeInstructions)
      Write table to disk in parquet format with partitioning columns written as "key=value" format in a nested directory structure. To generate these individual partitions, this method will call partitionBy on all the partitioning columns of provided table. The generated parquet files will have names of the format provided by ParquetInstructions.baseNameForPartitionedParquetData(). By default, any indexing columns present on the source table will be written as sidecar tables. To write only a subset of the indexes or add additional indexes while writing, use ParquetInstructions.Builder.addIndexColumns(java.lang.String...).
      Parameters:
      sourceTable - The table to partition and write
      destinationDir - The path or URI to destination root directory to store partitioned data in nested format. Non-existing directories are created.
      writeInstructions - Write instructions for customizations while writing
    • writeKeyValuePartitionedTable

      public static void writeKeyValuePartitionedTable(@NotNull @NotNull PartitionedTable partitionedTable, @NotNull @NotNull String destinationDir, @NotNull @NotNull ParquetInstructions writeInstructions)
      Write a partitioned table to disk in parquet format with all the key columns as "key=value" format in a nested directory structure. To generate the partitioned table, users can call partitionBy on the required columns. The generated parquet files will have names of the format provided by ParquetInstructions.baseNameForPartitionedParquetData(). By default, this method does not write any indexes as sidecar tables to disk. To write such indexes, use ParquetInstructions.Builder.addIndexColumns(java.lang.String...).
      Parameters:
      partitionedTable - The partitioned table to write
      destinationDir - The path or URI to destination root directory to store partitioned data in nested format. Non-existing directories are created.
      writeInstructions - Write instructions for customizations while writing
    • writeTables

      public static void writeTables(@NotNull @NotNull Table[] sources, @NotNull @NotNull String[] destinations, @NotNull @NotNull ParquetInstructions writeInstructions)
      Write out tables to disk. Data indexes to write are determined by those already present on the first source or those provided through ParquetInstructions.Builder.addIndexColumns(java.lang.String...). If all source tables have the same definition, this method will use the common definition for writing. Else, a definition must be provided through the writeInstructions.
      Parameters:
      sources - The tables to write
      destinations - The destination paths or URIs. Any non-existing directories in the paths provided are created. If there is an error, any intermediate directories previously created are removed; note this makes this method unsafe for concurrent use.
      writeInstructions - Write instructions for customizations while writing
    • deleteTable

      @VisibleForTesting public static void deleteTable(String path)
      Deletes a table on disk.
      Parameters:
      path - path to delete
    • readTable

      public static Table readTable(@NotNull @NotNull TableLocationKeyFinder<ParquetTableLocationKey> locationKeyFinder, @NotNull @NotNull ParquetInstructions readInstructions)
      Reads in a table from files discovered with locationKeyFinder using a definition either provided using ParquetInstructions or built from the highest (by location key order) location found, which must have non-null partition values for all partition keys.

      Callers may prefer the simpler methods readTable(String, ParquetInstructions) with layout provided using ParquetInstructions.Builder.setFileLayout(io.deephaven.parquet.table.ParquetInstructions.ParquetFileLayout).

      Parameters:
      locationKeyFinder - The source of location keys to include
      readInstructions - Instructions for customizations while reading
      Returns:
      The table
    • readParquetSchemaAndTable

      @VisibleForTesting public static Table readParquetSchemaAndTable(@NotNull @NotNull File source, @NotNull @NotNull ParquetInstructions readInstructionsIn, @Nullable @Nullable org.apache.commons.lang3.mutable.MutableObject<ParquetInstructions> mutableInstructionsOut)