Export Deephaven Tables to Parquet Files
The Deephaven Parquet module provides tools to integrate Deephaven with the Parquet file format. This module makes it easy to write Deephaven tables to Parquet files and directories. This document covers writing Deephaven tables to single Parquet files, flat partitioned Parquet directories, and key-value partitioned Parquet directories.
By default, Deephaven tables are written to Parquet files using SNAPPY
compression when writing the data. This default can be changed with the ParquetInstructions.Builder.setCompressionCodecName
method in any of the writing functions discussed here.
First, create some tables that will be used for the examples in this guide.
mathGrades = newTable(
stringCol("Name", "Ashley", "Jeff", "Rita", "Zach"),
stringCol("Class", "Math", "Math", "Math", "Math"),
intCol("Test1", 92, 78, 87, 74),
intCol("Test2", 94, 88, 81, 70),
)
scienceGrades = newTable(
stringCol("Name", "Ashley", "Jeff", "Rita", "Zach"),
stringCol("Class", "Science", "Science", "Science", "Science"),
intCol("Test1", 87, 90, 99, 80),
intCol("Test2", 91, 83, 95, 78),
)
historyGrades = newTable(
stringCol("Name", "Ashley", "Jeff", "Rita", "Zach"),
stringCol("Class", "History", "History", "History", "History"),
intCol("Test1", 82, 87, 84, 76),
intCol("Test2", 88, 92, 85, 78),
)
grades = merge(mathGrades, scienceGrades, historyGrades)
gradesPartitioned = grades.partitionBy("Class")
- grades
- mathGrades
- scienceGrades
- historyGrades
Write to a single Parquet file
Write a Deephaven table to a single Parquet file with ParquetTools.writeTable
. Supply the sourceTable
argument with the Deephaven table to be written, and the destination
argument with the file path for the resulting Parquet file. This file path should end with the .parquet
file extension. A compression codec enum can be specified to compress the output file.
import io.deephaven.parquet.table.ParquetTools
// write to a standard, uncompressed Parquet file
ParquetTools.writeTable(grades, "/data/grades/grades.parquet")
// write to a GZIP-compressed Parquet file
ParquetTools.writeTable(grades, "/data/grades/grades_gzip.parquet", ParquetTools.GZIP)
Write _metadata
and _common_metadata
files by calling Builder.setGenerateMetadataFiles(true)
. Parquet metadata files are useful for reading very large datasets, as they enhance the performance of the read operation significantly. If the data might be read in the future, consider writing metadata files.
import io.deephaven.parquet.table.ParquetInstructions
ParquetTools.writeTable(
grades,
"/data/grades_meta/grades.parquet",
ParquetInstructions.builder().setGenerateMetadataFiles(true).build()
)
Partitioned Parquet directories
Deephaven supports writing tables to partitioned Parquet directories. A partitioned Parquet directory organizes data into subdirectories based on one or more partitioning columns. This structure allows for more efficient data querying by pruning irrelevant partitions, leading to faster read times than a single Parquet file. Deephaven tables can be written to flat partitioned directories or key-value partitioned directories.
Data can be written to partitioned directories from Deephaven tables or from Deephaven's partitioned tables. Partitioned tables have partitioning columns built into the API, so Deephaven can use those partitioning columns to create partitioned directories. Regular Deephaven tables do not have partitioning columns, so the user must provide that information using the table_definition
argument to any of the writing functions.
Table definitions represent a table's schema. They are constructed from lists of Deephaven ColumnDefinition
objects that specify a column's name and type. Additionally, ColumnDefinition
objects are used to specify whether a particular column is a partitioning column by calling the withPartitioning()
method.
Create a table definition for the grades
table defined above.
import io.deephaven.engine.table.TableDefinition
import io.deephaven.engine.table.ColumnDefinition
gradesDef = TableDefinition.of(ColumnDefinition.ofString("Name"),
// Class is declared to be a partitioning column
ColumnDefinition.ofString("Class").withPartitioning(),
ColumnDefinition.ofInt("Test1"),
ColumnDefinition.ofInt("Test2")
)
Write to a key-value partitioned Parquet directory
Key-value partitioned Parquet directories extend partitioning by organizing data based on key-value pairs in the directory structure. This allows for highly granular and flexible data access patterns, providing efficient querying for complex datasets. The downside is the added complexity in managing and maintaining the key-value pairs, which can be more intricate than other partitioning methods.
Use ParquetTools.writeKeyValuePartitionedTable
to write Deephaven tables to key-value partitioned Parquet directories. Supply a Deephaven table or a partitioned table to the partitionedTable
argument, and set the destinationDir
argument to the destination root directory where the partitioned Parquet data will be stored. Non-existing directories in the provided path will be created.
// write a standard Deephaven table, must specify table_definition
ParquetTools.writeTable(
grades, "/data/grades_kv_1.parquet", ParquetInstructions.builder().setTableDefinition(gradesDef).build()
)
// or write a partitioned table
ParquetTools.writeKeyValuePartitionedTable(gradesPartitioned, "/data/grades_kv_2.parquet", ParquetInstructions.builder().setTableDefinition(gradesDef).build())
Call the setGenerateMetadataFiles
method to write metadata files.
ParquetTools.writeKeyValuePartitionedTable(
gradesPartitioned,
"/data/grades_kv_2_md.parquet",
ParquetInstructions.builder().setGenerateMetadataFiles(true).build()
)
Write to a flat partitioned Parquet directory
A flat partitioned Parquet directory stores data without nested subdirectories. Each file contains partition information within its filename or as metadata. This approach simplifies directory management compared to hierarchical partitioning but can lead to larger directory listings, which might affect performance with many partitions.
Use ParquetTools.writeTable
or ParquetTools.writeTables
to write Deephaven tables to Parquet files in flat partitioned directories. ParquetTools.writeTable
requires multiple calls to write multiple tables to the destination, while ParquetTools.writeTables
can write multiple tables to multiple paths in a single call.
Supply ParquetTools.writeTable
with the Deephaven table to be written and the destination file path with the table
and path
arguments. The path
must end with the .parquet
file extension.
ParquetTools.writeTable(grades, "/data/grades_flat_1/math.parquet")
ParquetTools.writeTable(grades, "/data/grades_flat_1/science.parquet")
ParquetTools.writeTable(grades, "/data/grades_flat_1/history.parquet")
Use ParquetTools.writeTables
to accomplish the same thing by passing multiple tables to the tables
argument and multiple destination paths to the paths
argument. This requires the table_definition
argument to be specified.
ParquetTools.writeTables(
new Table[] {mathGrades, scienceGrades, historyGrades},
new String[] {
"/data/grades_flat_2/math.parquet",
"/data/grades_flat_2/science.parquet",
"/data/grades_flat_2/history.parquet",
},
ParquetInstructions.builder().setTableDefinition(gradesDef).build(),
)
To write a Deephaven partitioned table to a flat partitioned Parquet directory, the table must first be broken into a list of constituent tables, such as by calling PartitionedTable.constituents()
. Then ParquetTools.writeTables
can be used to write all of the resulting constituent tables to Parquet. Again, the table_definition
argument must be specified.
ParquetTools.writeTables(
gradesPartitioned.constituents(),
new String[] {
"/data/grades_flat_2/math.parquet",
"/data/grades_flat_2/science.parquet",
"/data/grades_flat_2/history.parquet",
},
ParquetInstructions.builder().setTableDefinition(gradesDef).build(),
)