Skip to main content
Version: Python

How to write and read single Parquet files

This guide will show you how to write and read data to/from a Deephaven table and from/to a single Parquet file using the write and read methods.

The basic syntax follows:

  • write(source, "/data/output.parquet")
  • write(source, "/data/output_GZIP.parquet", compression_codec_name="GZIP")
  • read("/data/output.parquet")
  • read("/data/output_GZIP.parquet")

Write a table to a Parquet file

Let's create a table to write that contains student names, test scores, and GPAs.

from deephaven import new_table
from deephaven.column import int_col, double_col, string_col

grades = new_table(
[
string_col("Name", ["Ashley", "Jeff", "Rita", "Zach"]),
int_col("Test1", [92, 78, 87, 74]),
int_col("Test2", [94, 88, 81, 70]),
int_col("Average", [93, 83, 84, 72]),
double_col("GPA", [3.9, 2.9, 3.0, 1.8]),
]
)

Now, use the write method to export the table to a Parquet file. write takes the following arguments:

  1. The table to be written. In this case, grades.
  2. The Parquet file to write to. In this case, /data/grades_GZIP.parquet.
  3. (Optional) parquetInstructions for writing files using compression codecs. Accepted values are:
    • SNAPPY: Aims for high speed and a reasonable amount of compression. Based on Google's Snappy compression format. If ParquetInstructions is not specified, it defaults to SNAPPY.
    • UNCOMPRESSED: The output will not be compressed.
    • LZ4_RAW: A codec based on the LZ4 block format. Should always be used instead of LZ4.
    • LZ4: Deprecated Compression codec loosely based on the LZ4 compression algorithm, but with an additional undocumented framing scheme. The framing is part of the original Hadoop compression library and was historically copied first in parquet-mr, then emulated with mixed results by parquet-cpp. Note that LZ4 is deprecated; use LZ4_RAW instead.
    • LZO: Compression codec based on or interoperable with the LZO compression library.
    • GZIP: Compression codec based on the GZIP format (not the closely-related "zlib" or "deflate" formats) defined by RFC 1952.
    • ZSTD: Compression codec with the highest compression ratio based on the Zstandard format defined by RFC 8478.
note

In this guide, we write data to locations relative to the base of its Docker container. See Docker data volumes to learn more about the relation between locations in the container and the local file system.

from deephaven.parquet import write

write(grades, "/data/grades_GZIP.parquet", compression_codec_name="GZIP")

Read a Parquet file into a table

deephaven.parquet.read reads a Parquet file into Deephaven as a table. A number of input parameters can be used, however, only one is required if the file is stored locally:

  • path: The Parquet file to read. In this case, it's /data/grades_GZIP.parquet.
note

For more information on the file path, see Docker data volumes.

from deephaven.parquet import read

result = read("/data/grades_GZIP.parquet")

There are a number of optional input parameters that can be used when reading from Parquet. For more information, see read.

Read a file from AWS S3

Deephaven supports reading Parquet files from two places: your local filesystem, and AWS S3. The following code block reads a public Parquet dataset from an AWS S3 bucket. To do so, deephaven.experimental.s3 is used to specify how the read is done.

from deephaven import parquet
from deephaven.experimental import s3
from datetime import timedelta

drivestats = parquet.read(
"s3://drivestats-parquet/drivestats/year=2023/month=02/2023-02-1.parquet",
special_instructions=s3.S3Instructions(
region_name="us-west-004",
endpoint_override="https://s3.us-west-004.backblazeb2.com",
anonymous_access=True,
read_ahead_count=8,
fragment_size=65536,
read_timeout=timedelta(seconds=10),
),
)

The following input parameters are specified in the example. Only the first (region_name) is required:

  • region_name: This mandatory parameter defines the region name of the AWS S3 bucket where the Parquet data exists.
  • endpoint_override: The endpoint to connect to. The default is None.
  • anonymous_access: A boolean indicating to use anonymous credentials. The default is False.
  • read_ahead_count: The number of fragments that are asynchronously read ahead of the current fragment as the current fragment is being read. The default is 1.
  • fragment_size: The maximum size of each fragment to read in bytes. The default is 5 MB.
  • read_timeout: The amount of time it takes to time out while reading a fragment. The default is 2 seconds.

The following optional input parameters are not specified in the example, but can be used:

  • max_concurrent_requests: The maximum number of concurrent requests to make to S3. The default is 50.
  • max_cache_size: The maximum number of fragments to cache in memory while reading. The default is 32.
  • connection_timeout: The amount of time to wait for a successful S3 connection before timing out. The default is 2 seconds.
  • access_key_id: The access key for reading files. If set, secret_access_key must also be set.
  • secret_access_key: The secret access key for reading files.