Export Deephaven Tables to Parquet Files

The Deephaven Parquet module provides tools to integrate Deephaven with the Parquet file format. This module makes it easy to write Deephaven tables to Parquet files and directories. This document covers writing Deephaven tables to single Parquet files, flat partitioned Parquet directories, and key-value partitioned Parquet directories.

By default, Deephaven tables are written to Parquet files using SNAPPY compression when writing the data. This default can be changed with the ParquetInstructions.Builder.setCompressionCodecName method in any of the writing functions discussed here or with the addColumnCodec method for column-specific compression. See the Parquet instructions document for more information.

Note

Much of this document covers writing Parquet files to S3. For the best performance, the Deephaven instance should be running in the same AWS region as the S3 bucket. Additional performance improvements can be made by using directory buckets to localize all data to a single AWS sub-region, and running the Deephaven instance in that same sub-region. See this article for more information on S3 directory buckets. Take care to replace the S3 authentication details in the examples with the correct values for your S3 instance.

First, create some tables that will be used for the examples in this guide.

Write to a single Parquet file

To local storage

Write a Deephaven table to a single Parquet file with ParquetTools.writeTable. Supply the sourceTable argument with the Deephaven table to be written, and the destination argument with the destination file path for the resulting Parquet file. This file path should end with the .parquet file extension. An optional instructions argument can be provided to specify compression and other settings.

Write _metadata and _common_metadata files by calling Builder.setGenerateMetadataFiles(true). Parquet metadata files are useful for reading very large datasets, as they enhance the performance of the read operation significantly. If the data might be read in the future, consider writing metadata files.

To S3

Similarly, use ParquetTools.writeTable to write Deephaven tables to Parquet files on S3. The destination should be the URI of the destination file in S3. Supply an instance of the S3Instructions class to the ParquetInstructions.Builder to specify the details of the connection to the S3 instance.

Partitioned Parquet directories

Deephaven supports writing tables to partitioned Parquet directories. A partitioned Parquet directory organizes data into subdirectories based on one or more partitioning columns. This structure allows for more efficient data querying by pruning irrelevant partitions, leading to faster read times than a single Parquet file. Deephaven tables can be written to flat partitioned directories or key-value partitioned directories.

Data can be written to partitioned directories from Deephaven tables or from Deephaven's partitioned tables. Partitioned tables have partitioning columns built into the API, so Deephaven can use those partitioning columns to create partitioned directories. Regular Deephaven tables do not have partitioning columns, so the user must provide that information using the table_definition argument to any of the writing functions.

Table definitions represent a table's schema. They are constructed from lists of Deephaven ColumnDefinition objects that specify a column's name and type. Additionally, ColumnDefinition objects are used to specify whether a particular column is a partitioning column by calling the withPartitioning method.

Create a table definition for the grades table defined above.

Write to a key-value partitioned Parquet directory

Key-value partitioned Parquet directories extend partitioning by organizing data based on key-value pairs in the directory structure. This allows for highly granular and flexible data access patterns, providing efficient querying for complex datasets. The downside is the added complexity in managing and maintaining the key-value pairs, which can be more intricate than other partitioning methods.

To local storage

Use ParquetTools.writeKeyValuePartitionedTable to write Deephaven tables to key-value partitioned Parquet directories. Supply a Deephaven table or a partitioned table to the partitionedTable argument, and set the destinationDir argument to the destination root directory where the partitioned Parquet data will be stored. Non-existing directories in the provided path will be created.

Call setGenerateMetadataFiles(true) to write metadata files.

To S3

Use ParquetTools.writeKeyValuePartitionedTable to write key-value partitioned Parquet directories to S3. The destinationDir should be the URI of the destination directory in S3. Supply an instance of the S3Instructions class to the ParquetInstructions.Builder to specify the details of the connection to the S3 instance.

Write to a flat partitioned Parquet directory

A flat partitioned Parquet directory stores data without nested subdirectories. Each file contains partition information within its filename or as metadata. This approach simplifies directory management compared to hierarchical partitioning but can lead to larger directory listings, which might affect performance with many partitions.

To local storage

Use ParquetTools.writeTable or ParquetTools.writeTables to write Deephaven tables to Parquet files in flat partitioned directories. ParquetTools.writeTable requires multiple calls to write multiple tables to the destination, while ParquetTools.writeTables can write multiple tables to multiple paths in a single call.

Supply ParquetTools.writeTable with the Deephaven table to be written and the destination file path with the table and path arguments. The path must end with the .parquet file extension.

Use ParquetTools.writeTables to accomplish the same thing by passing multiple tables to the tables argument and multiple destination paths to the paths argument. This requires the table_definition argument to be specified.

To write a Deephaven partitioned table to a flat partitioned Parquet directory, the table must first be broken into a list of constituent tables, such as by calling PartitionedTable.constituents(). Then ParquetTools.writeTables can be used to write all of the resulting constituent tables to Parquet. Again, the table_definition argument must be specified.

To S3

Use ParquetTools.writeTables to write a list of Deephaven tables to a flat partitioned Parquet directory in S3. The paths should be the URIs of the destination files in S3. Supply an instance of the S3Instructions class to the ParquetInstructions.Builder to specify the details of the connection to the S3 instance.