Parquet instructions
All of the specific, detailed instructions for reading Parquet files into Deephaven tables are passed into readTable
as an instance of the ParquetInstructions
class. This class is used to specify the layout of the Parquet files, the table definition, and any special instructions for reading the files.
ParquetInstructions
A ParquetInstructions
instance is created using the ParquetInstructions.builder()
method, which returns a ParquetInstructions.Builder
instance. Instructions are specified by calling the builder's methods, and then the build()
method to create the ParquetInstructions
instance. For example, to specify the layout of the Parquet files as key-value partitioned, use the following code:
import io.deephaven.parquet.table.ParquetInstructions
import io.deephaven.parquet.table.ParquetTools
import io.deephaven.parquet.table.ParquetInstructions.ParquetFileLayout
// create ParquetInstructions instance with key-value partitioned layout
instructionsInstance = ParquetInstructions.builder().setFileLayout(ParquetFileLayout.valueOf("SINGLE_FILE")).build()
// pass instructionsInstance to readTable
taxi = ParquetTools.readTable("/data/examples/Taxi/parquet/taxi.parquet", instructionsInstance)
- taxi
ParquetInstructions
methods
The ParquetInstructions
class has the following methods:
baseNameForPartitionedParquetData()
: Returns the base name for partitioned parquet data. Can be set withBuilder.setBaseNameForPartitionedParquetData
.builder()
: Returns a newParquetInstructions.Builder
instance.generateMetadataFiles()
: Returns a boolean indicating whether theParquetInstructions
instance is set to generate "_metadata" and "_common_metadata" files while writing parquet files.getCodecArgs(columnName)
: Returns the codec arguments for the specified column.getCodecName(columnName)
: Returns the codec name for the specified column.getColumnNameFromParquetColumnName(parquetColumnName)
: Returns the column name in the Deephaven table corresponding to the specified Parquet column name.getColumnNameFromParquetColumnNameOrDefault(parquetColumnName)
: Returns the column name in the Deephaven table corresponding to the specified Parquet column name, or the Parquet column name if no mapping exists.getCompressionCodecName()
: Returns the compression codec name.getDefaultCompressionCodecName()
: Returns the default compression codec name.getDefaultMaximumDictionaryKeys()
: Returns the default maximum dictionary keys.getDefaultMaximumDictionarySize()
: Returns the default maximum dictionary size.getDefaultTargetPageSize()
: Returns the default target page size.getFileLayout()
: Returns the Parquet file layout.getIndexColumns()
: Returns the index columns.getMaximumDictionaryKeys()
: Returns the maximum dictionary keys.getMaximumDictionarySize()
: Returns the maximum dictionary size.getParquetColumnNameFromColumnNameOrDefault(columnName)
: Returns the Parquet column name corresponding to the specified column name, or the column name if no mapping exists.getSpecialInstructions()
: Returns the special instructions set for thisParquetInstructions
instance.getTableDefinition()
: Returns the table definition.getTargetPageSize()
: Returns the target page size.isLegacyParquet()
: Returns a boolean indicating whether the Parquet data is in legacy format.isRefreshing()
: Returns a boolean indicating whether the Parquet data represents a refreshing source.sameColumnNamesAndCodecMappings(i1, i2)
: Returns a boolean indicating whether the twoParquetInstructions
instances have the same column names and codec mappings.setDefaultMaximumDictionaryKeys(maximumDictionaryKeys)
: Sets the default maximum dictionary keys.setDefaultMaximumDictionarySize(maximumDictionarySize)
: Sets the default maximum dictionary size.setDefaultTargetPageSize(newDefaultSizeBytes)
: Sets the default target page size.useDictionary(columnName)
: Returns a boolean indicating whether the specified column uses dictionary encoding.withLayout(fileLayout)
: Returns a newParquetInstructions
instance with the suppliedParquetFileLayout
.withTableDefinition(tableDefinition)
: Returns a newParquetInstructions
instance with the supplied table definition.withTableDefinitionAndLayout(tableDefinition, fileLayout)
: Returns a newParquetInstructions
instance with the supplied table definition andParquetFileLayout
.
ParquetInstructions.Builder
methods
The ParquetInstructions.Builder
class has the following methods:
addAllIndexColumns(indexColumns)
: Adds provided lists of columns to persist together as indexes. This method accepts an Iterable of lists, where each list represents a group of columns to be indexed together. The write operation will store the index info as sidecar tables. This argument is used to narrow the set of indexes to write, or to be explicit about the expected set of indexes present on all sources. Indexes that are specified but missing will be computed on demand. To prevent the generation of index files, provide an empty iterable.addColumnCodec(columnName, codecName)
: Adds a column codec mapping between the provided column name and codec name.addColumnNameMapping(parquetColumnName, columnName)
: Adds a column name mapping between the provided Parquet column name and Deephaven column name.addIndexColumns(indexColumns...)
: Add a list of columns to persist together as indexes. The write operation will store the index info as sidecar tables. This argument is used to narrow the set of indexes to write, or to be explicit about the expected set of indexes present on all sources. Indexes that are specified but missing will be computed on demand.build()
: Builds theParquetInstructions
instance.getTakenNames()
: Returns a set of column names that have been taken.setBaseNameForPartitionedParquetData(baseNameForPartitionedParquetData)
: Sets the base name for partitioned parquet data.setCompressionCodecName(compressionCodecName)
: The name of the compression codec to use. This defines the particular type of compression used for the given column and can have significant implications for the speed of the import. The options are:SNAPPY
: (default) Aims for high speed and a reasonable amount of compression. Based on Google's Snappy compression format.UNCOMPRESSED
: The output will not be compressed.LZ4_RAW
: A codec based on the LZ4 block format. Should always be used instead ofLZ4
.LZO
: Compression codec based on or interoperable with the LZO compression library.GZIP
: Compression codec based on the GZIP format (not the closely-related "zlib" or "deflate" formats) defined by RFC 1952.ZSTD
: Compression codec with the highest compression ratio based on the Zstandard format defined by RFC 8478.LZ4
: Deprecated Compression codec loosely based on the LZ4 compression algorithm, but with an additional undocumented framing scheme. The framing is part of the original Hadoop compression library and was historically copied first in parquet-mr, then emulated with mixed results by parquet-cpp. Note thatLZ4
is deprecated; useLZ4_RAW
instead.
setFileLayout(fileLayout)
: Sets the Parquet file layout. Use withParquetFileLayout.valueOf(<Enum>)
. If this method is not called, layout is inferred. Enums are:- "SINGLE_FILE": A single Parquet file.
- "FLAT_PARTITIONED": A single directory of Parquet files with no nested subdirectories.
- "KV_PARTITIONED": A directory of Parquet files partitioned by key-value pairs.
- "METADATA_PARTITIONED": This layout can be used to describe either a single Parquet "_metadata" or "_common_metadata" file, or a directory containing a "_metadata" file and an optional "_common_metadata" file.
setGenerateMetadataFiles(generateMetadataFiles)
: Sets whether or not to generate "_metadata" and "_common_metadata" files while writing Parquet files.setIsLegacyParquet(isLegacyParquet)
: Sets whether the Parquet data is in legacy format.setIsRefreshing(isRefreshing)
: Sets whether the Parquet data represents a refreshing source.setMaximumDictionaryKeys(maximumDictionaryKeys)
: Sets the maximum dictionary keys.setMaximumDictionarySize(maximumDictionarySize)
: Set the maximum number of bytes the writer should add to the dictionary before switching to non-dictionary encoding; never evaluated for non-String columns, ignored if use dictionary is set for the column.setSpecialInstructions(specialInstructions)
: Special instructions for reading Parquet files, useful when reading files from a non-local S3 server. These instructions are provided as an instance ofS3Instructions
.setTableDefinition(tableDefinition)
: Sets the table definition.setTargetPageSize(targetPageSize)
: Sets the target page size.useDictionary(columnName, useDictionary)
: Set a hint that the writer should use dictionary-based encoding for writing this column; never evaluated for non-String columns.
S3Instructions
methods
The S3Instructions
class has the following methods:
append(LogOutput)
:builder()
: Returns a newS3Instructions.Builder
instance.connectionTimeout()
: A Duration representing the amount of time to wait for a successful S3 connection before timing out. The default is 2 seconds.credentials()
: TheCredentials
to use for reading files. Options are:Credentials.anonymous()
: Use anonymous credentials.Credentials.basic(accessKeyId, secretAccessKey)
: Use basic credentials with the specified access key ID and secret access key.Credentials.defaultCredentials()
: Use the default credentials.
endpointOverride()
: The endpoint to connect to. Callers connecting to AWS do not typically need to set this; it is most useful when connecting to non-AWS, S3-compatible APIs. The default isNone
fragmentSize()
: The maximum byte size of each fragment to read from S3. Defaults to 65536; must be larger than 8192.maxConcurrentRequests()
: The maximum number of concurrent requests to make to S3. Defaults to 256.numConcurrentWriterParts()
: The maximum number of parts that can be uploaded concurrently when writing to S3 without blocking.readAheadCount()
: The number of fragments asynchronously read ahead of the current fragment as the current fragment is being read. The default is1
.readTimeout()
: The amount of time it takes to time out while reading a fragment. The default is 2 seconds.regionName()
: The region name of the AWS S3 bucket where the Parquet data exists.writePartSize()
: The size of each part (in bytes) to upload when writing to S3. Default is 10485760.