read
The read
method will read a single Parquet file, metadata file, or directory with a recognized layout into an in-memory table.
Syntax
read(
path: str,
col_instructions: str = None,
is_legacy_parquet: bool = False,
is_refreshing: bool = False
) -> Table
Parameters
Parameter | Type | Description |
---|---|---|
path | str | The file to load into a table. The file should exist and end with the |
col_instructions optional | list[ColumnInstruction] | Optional instructions for customizations while reading. Valid values are:
|
is_legacy_parquet optional | bool | Whether or not the Parquet data is legacy. |
is_refreshing optional | bool | Whether or not the Parquet data represents a refreshing source. |
Returns
A new in-memory table from a Parquet file, metadata file, or directory with a recognized layout.
Examples
All examples in this document use data mounted in /data
in Deephaven. For more information on the relation between this location in Deephaven and on your local file system, see Docker data volumes.
Single Parquet file
For the following examples, the example data found in Deephaven's example repository will be used. Follow the instructions in Launch Deephaven from pre-built images
to download and manage the example data.
In this example, read
is used to load the file /data/examples/Taxi/parquet/taxi.parquet
into a Deephaven table.
from deephaven.parquet import read
source = read("/data/examples/Taxi/parquet/taxi.parquet")
- source
Compression codec
In this example, read
is used to load the file /data/output_GZIP.parquet
, with GZIP
compression, into a Deephaven table.
This file needs to exist for this example to work. To generate this file, see write
.
from deephaven.parquet import read, write
from deephaven import new_table
from deephaven.column import string_col, int_col
source = new_table([
string_col("X", ["A", "B", "B", "C", "B", "A", "B", "B", "C"]),
int_col("Y", [2, 4, 2, 1, 2, 3, 4, 2, 3]),
int_col("Z", [55, 76, 20, 4, 230, 50, 73, 137, 214]),
])
write(source, "/data/output_GZIP.parquet", compression_codec_name="GZIP")
source = read("/data/output_GZIP.parquet")
- source
Partitioned datasets
_metadata
and/or _common_metadata
files are occasionally present in partitioned datasets. These files can be used to load Parquet data sets more quickly. These files are specific to only certain frameworks and are not required to read the data into a Deephaven table.
_common_metadata
: File containing schema information needed to load the whole dataset faster._metadata
: File containing (1) complete relative pathnames to individual data files, and (2) column statistics, such as min, max, etc., for the individual data files.
For a directory of Parquet files, all sub-directories are also searched. Only files with a .parquet
extension or _common_metadata
and _metadata
files should be located in these directories. All files ending with .parquet
need the same schema.
The following examples use data in Deephaven's example repository. Follow the instructions in Launch Deephaven from pre-built images
to download and manage the example data.
In this example, read
is used to load the directory /data/examples/Pems/parquet/pems
into a Deephaven table.
from deephaven.parquet import read
source = read("/data/examples/Pems/parquet/pems")
- source