Skip to main content
Version: Python

Parquet Cheat Sheet

Optional instructions for customizations while writing. Valid values are:

  • LZ4: Compression codec loosely based on the LZ4 compression algorithm, but with an additional undocumented framing scheme. The framing is part of the original Hadoop compression library and was historically copied first in parquet-mr, then emulated with mixed results by parquet-cpp.
  • LZO: Compression codec based on or interoperable with the LZO compression library.
  • GZIP: Compression codec based on the GZIP format (not the closely-related "zlib" or "deflate" formats) defined by RFC 1952.
  • ZSTD: Compression codec with the highest compression ratio based on the Zstandard format defined by RFC 8478.

Reading instructions have all the above plus LEGACY avaialable:

  • LEGACY: Load any binary fields as strings. Helpful to load files written in older versions of Parquet that lacked a distinction between binary and string.
# Create a table
from deephaven import new_table
from deephaven.column import string_col, int_col
from deephaven import parquet as parquet

source = new_table([
string_col("X", ["A", "B", "B", "C", "B", "A", "B", "B", "C"]),
int_col("Y", [2, 4, 2, 1, 2, 3, 4, 2, 3]),
int_col("Z", [55, 76, 20, 4, 230, 50, 73, 137, 214]),
])

# Write to a local file
parquet.write(source, "/data/output.parquet")

# Read from a local file
source_read = parquet.read("/data/output.parquet")