Skip to main content
Version: Java (Groovy)

Export Deephaven Tables to Parquet Files

The Deephaven Parquet module provides tools to integrate Deephaven with the Parquet file format. This module makes it easy to write Deephaven tables to Parquet files and directories. This document covers writing Deephaven tables to single Parquet files, flat partitioned Parquet directories, and key-value partitioned Parquet directories.

By default, Deephaven tables are written to Parquet files using SNAPPY compression when writing the data. This default can be changed with the ParquetInstructions.Builder.setCompressionCodecName method in any of the writing functions discussed here.

First, create some tables that will be used for the examples in this guide.

mathGrades = newTable(
stringCol("Name", "Ashley", "Jeff", "Rita", "Zach"),
stringCol("Class", "Math", "Math", "Math", "Math"),
intCol("Test1", 92, 78, 87, 74),
intCol("Test2", 94, 88, 81, 70),
)

scienceGrades = newTable(
stringCol("Name", "Ashley", "Jeff", "Rita", "Zach"),
stringCol("Class", "Science", "Science", "Science", "Science"),
intCol("Test1", 87, 90, 99, 80),
intCol("Test2", 91, 83, 95, 78),
)

historyGrades = newTable(
stringCol("Name", "Ashley", "Jeff", "Rita", "Zach"),
stringCol("Class", "History", "History", "History", "History"),
intCol("Test1", 82, 87, 84, 76),
intCol("Test2", 88, 92, 85, 78),
)

grades = merge(mathGrades, scienceGrades, historyGrades)

gradesPartitioned = grades.partitionBy("Class")

Write to a single Parquet file

Write a Deephaven table to a single Parquet file with ParquetTools.writeTable. Supply the sourceTable argument with the Deephaven table to be written, and the destination argument with the file path for the resulting Parquet file. This file path should end with the .parquet file extension. A compression codec enum can be specified to compress the output file.

import io.deephaven.parquet.table.ParquetTools

// write to a standard, uncompressed Parquet file
ParquetTools.writeTable(grades, "/data/grades/grades.parquet")

// write to a GZIP-compressed Parquet file
ParquetTools.writeTable(grades, "/data/grades/grades_gzip.parquet", ParquetTools.GZIP)

Write _metadata and _common_metadata files by calling Builder.setGenerateMetadataFiles(true). Parquet metadata files are useful for reading very large datasets, as they enhance the performance of the read operation significantly. If the data might be read in the future, consider writing metadata files.

import io.deephaven.parquet.table.ParquetInstructions

ParquetTools.writeTable(
grades,
"/data/grades_meta/grades.parquet",
ParquetInstructions.builder().setGenerateMetadataFiles(true).build()
)

Partitioned Parquet directories

Deephaven supports writing tables to partitioned Parquet directories. A partitioned Parquet directory organizes data into subdirectories based on one or more partitioning columns. This structure allows for more efficient data querying by pruning irrelevant partitions, leading to faster read times than a single Parquet file. Deephaven tables can be written to flat partitioned directories or key-value partitioned directories.

Data can be written to partitioned directories from Deephaven tables or from Deephaven's partitioned tables. Partitioned tables have partitioning columns built into the API, so Deephaven can use those partitioning columns to create partitioned directories. Regular Deephaven tables do not have partitioning columns, so the user must provide that information using the table_definition argument to any of the writing functions.

Table definitions represent a table's schema. They are constructed from lists of Deephaven ColumnDefinition objects that specify a column's name and type. Additionally, ColumnDefinition objects are used to specify whether a particular column is a partitioning column by calling the withPartitioning() method.

Create a table definition for the grades table defined above.

import io.deephaven.engine.table.TableDefinition
import io.deephaven.engine.table.ColumnDefinition

gradesDef = TableDefinition.of(ColumnDefinition.ofString("Name"),
// Class is declared to be a partitioning column
ColumnDefinition.ofString("Class").withPartitioning(),
ColumnDefinition.ofInt("Test1"),
ColumnDefinition.ofInt("Test2")
)

Write to a key-value partitioned Parquet directory

Key-value partitioned Parquet directories extend partitioning by organizing data based on key-value pairs in the directory structure. This allows for highly granular and flexible data access patterns, providing efficient querying for complex datasets. The downside is the added complexity in managing and maintaining the key-value pairs, which can be more intricate than other partitioning methods.

Use ParquetTools.writeKeyValuePartitionedTable to write Deephaven tables to key-value partitioned Parquet directories. Supply a Deephaven table or a partitioned table to the partitionedTable argument, and set the destinationDir argument to the destination root directory where the partitioned Parquet data will be stored. Non-existing directories in the provided path will be created.

// write a standard Deephaven table, must specify table_definition
ParquetTools.writeTable(
grades, "/data/grades_kv_1.parquet", ParquetInstructions.builder().setTableDefinition(gradesDef).build()
)

// or write a partitioned table
ParquetTools.writeKeyValuePartitionedTable(gradesPartitioned, "/data/grades_kv_2.parquet", ParquetInstructions.builder().setTableDefinition(gradesDef).build())

Call the setGenerateMetadataFiles method to write metadata files.

ParquetTools.writeKeyValuePartitionedTable(
gradesPartitioned,
"/data/grades_kv_2_md.parquet",
ParquetInstructions.builder().setGenerateMetadataFiles(true).build()
)

Write to a flat partitioned Parquet directory

A flat partitioned Parquet directory stores data without nested subdirectories. Each file contains partition information within its filename or as metadata. This approach simplifies directory management compared to hierarchical partitioning but can lead to larger directory listings, which might affect performance with many partitions.

Use ParquetTools.writeTable or ParquetTools.writeTables to write Deephaven tables to Parquet files in flat partitioned directories. ParquetTools.writeTable requires multiple calls to write multiple tables to the destination, while ParquetTools.writeTables can write multiple tables to multiple paths in a single call.

Supply ParquetTools.writeTable with the Deephaven table to be written and the destination file path with the table and path arguments. The path must end with the .parquet file extension.

ParquetTools.writeTable(grades, "/data/grades_flat_1/math.parquet")
ParquetTools.writeTable(grades, "/data/grades_flat_1/science.parquet")
ParquetTools.writeTable(grades, "/data/grades_flat_1/history.parquet")

Use ParquetTools.writeTables to accomplish the same thing by passing multiple tables to the tables argument and multiple destination paths to the paths argument. This requires the table_definition argument to be specified.

ParquetTools.writeTables(
new Table[] {mathGrades, scienceGrades, historyGrades},
new String[] {
"/data/grades_flat_2/math.parquet",
"/data/grades_flat_2/science.parquet",
"/data/grades_flat_2/history.parquet",
},
ParquetInstructions.builder().setTableDefinition(gradesDef).build(),
)

To write a Deephaven partitioned table to a flat partitioned Parquet directory, the table must first be broken into a list of constituent tables, such as by calling PartitionedTable.constituents(). Then ParquetTools.writeTables can be used to write all of the resulting constituent tables to Parquet. Again, the table_definition argument must be specified.

ParquetTools.writeTables(
gradesPartitioned.constituents(),
new String[] {
"/data/grades_flat_2/math.parquet",
"/data/grades_flat_2/science.parquet",
"/data/grades_flat_2/history.parquet",
},
ParquetInstructions.builder().setTableDefinition(gradesDef).build(),
)