How to write and read single Parquet files
This guide will show you how to write and read data to/from a Deephaven table and from/to a single Parquet file with both Python and Groovy, using the writeTable
and readTable
methods.
The basic syntax follows:
writeTable(source, "/data/output.parquet")
writeTable(source, "/data/output_GZIP.parquet", "GZIP")
readTable("/data/output.parquet")
readTable("/data/output_GZIP.parquet", "GZIP")
Write a table to a Parquet file
The Deephaven Query Language makes importing and manipulating data easy and efficient. In this example, we will import a Parquet file into a new, in-memory Deephaven table.
Start by creating the grades
table, containing student names, test scores, and GPAs.
grades = newTable(
stringCol("Name", "Ashley", "Jeff", "Rita", "Zach"),
intCol("Test1", 92, 78, 87, 74),
intCol("Test2", 94, 88, 81, 70),
intCol("Average", 93, 83, 84, 72),
doubleCol("GPA", 3.9, 2.9, 3.0, 1.8)
)
- grades
Now, use the writeTable
method to export the table to a Parquet file. writeTable
takes the following arguments:
- The table to be written. In this case,
grades
. - The Parquet file to write to. In this case,
/data/grades_GZIP.parquet
. - (Optional)
parquetInstructions
for writing files using compression codecs. Accepted values are:LZ4
: Compression codec loosely based on the LZ4 compression algorithm, but with an additional undocumented framing scheme. The framing is part of the original Hadoop compression library and was historically copied first in parquet-mr, then emulated with mixed results by parquet-cpp.LZO
: Compression codec based on or interoperable with the LZO compression library.GZIP
: Compression codec based on the GZIP format (not the closely-related "zlib" or "deflate" formats) defined by RFC 1952.ZSTD
: Compression codec with the highest compression ratio based on the Zstandard format defined by RFC 8478.
In this guide, we write data to locations relative to the base of its Docker container. See Docker data volumes to learn more about the relation between locations in the container and the local file system.
import io.deephaven.parquet.table.ParquetTools
ParquetTools.writeTable(grades, new File("/data/grades_GZIP.parquet"), ParquetTools.GZIP)
Read a Parquet file into a table
Now, use the readTable
method to import the Parquet file as a table. readTable
takes the following arguments:
- The Parquet file to read. In this case,
/data/grades_GZIP.parquet
. - (Optional )
parquetInstructions
for codecs when the file type can not be successfully infered. Accepted values are:LZ4
: Compression codec loosely based on the LZ4 compression algorithm, but with an additional undocumented framing scheme. The framing is part of the original Hadoop compression library and was historically copied first in parquet-mr, then emulated with mixed results by parquet-cpp.LZO
: Compression codec based on or interoperable with the LZO compression library.GZIP
: Compression codec based on the GZIP format (not the closely-related "zlib" or "deflate" formats) defined by RFC 1952.ZSTD
: Compression codec with the highest compression ratio based on the Zstandard format defined by RFC 8478.LEGACY
: Load any binary fields as strings. Helpful to load files written in older versions of Parquet that lacked a distinction between binary and string.
For more information on the file path, see Docker data volumes.
import io.deephaven.parquet.table.ParquetTools
result = ParquetTools.readTable(new File("/data/grades_GZIP.parquet"), ParquetTools.GZIP)
- result
Read large Parquet files
When we load a Parquet file into a table, we do not load the whole file into RAM. This means that files much larger than the available RAM can be loaded as tables.