Parquet Cheat Sheet
Optional instructions for customizations while writing. Valid values are:
SNAPPY
: Aims for high speed, and a reasonable amount of compression. Based on Google's Snappy compression format. IfParquetInstructions
is not specified, it defaults toSNAPPY
.UNCOMPRESSED
: The output will not be compressed.LZ4_RAW
: A codec based on the LZ4 block format. Should always be used instead ofLZ4
.LZ4
: Deprecated Compression codec loosely based on the LZ4 compression algorithm, but with an additional undocumented framing scheme. The framing is part of the original Hadoop compression library and was historically copied first in parquet-mr, then emulated with mixed results by parquet-cpp. Note thatLZ4
is deprecated; useLZ4_RAW
instead.LZO
: Compression codec based on or interoperable with the LZO compression library.GZIP
: Compression codec based on the GZIP format (not the closely-related "zlib" or "deflate" formats) defined by RFC 1952.ZSTD
: Compression codec with the highest compression ratio based on the Zstandard format defined by RFC 8478.
Reading instructions have all the above plus LEGACY
avaialable:
LEGACY
: Load any binary fields as strings. Helpful to load files written in older versions of Parquet that lacked a distinction between binary and string.
// Create a table
source = newTable(
stringCol("X", "A", "B", "B", "C", "B", "A", "B", "B", "C"),
intCol("Y",2, 4, 2, 1, 2, 3, 4, 2, 3),
intCol("Z", 55, 76, 20, 4, 230, 50, 73, 137, 214),
)
// Write to a local file
import static io.deephaven.parquet.table.ParquetTools.writeTable
writeTable(source, new File("/data/output.parquet"))
// Write to a local file with compression
writeTable(source, new File("/data/output_GZIP.parquet"), ParquetTools.GZIP)
// Read from a local file
import static io.deephaven.parquet.table.ParquetTools.readTable
source = readTable("/data/output.parquet")
// Read from a local compressed file
source = readTable("/data/output_GZIP.parquet", ParquetTools.GZIP)
// Read en entire directory or parquet files
// Only files with a `.parquet` extension or `_common_metadata` and `_metadata` files should be located in these directories.
// All files ending with `.parquet` need the same schema.
source = readTable("/data/examples/Pems/parquet/pems")