Iceberg and Deephaven
Apache Iceberg is a high-performance format for tabular data. Deephaven's Iceberg integration enables users to interact with Iceberg catalogs, namespaces, tables, and snapshots by ingesting them as tables. This guide creates an Iceberg catalog with a single table and snapshot. It then walks through how to interact with the catalog in the Deephaven IDE through:
- A REST API and MinIO instance
- S3 storage providers
Deephaven's Iceberg module
Deephaven's Iceberg integration is provided by the deephaven.experimental.iceberg
module. The module contains two classes and two functions:
Querying Iceberg tables in Deephaven uses the deephaven.experimental.s3
module to pull data from S3-compatible providers.
A Deephaven deployment for Iceberg
The examples presented in this guide pull Iceberg data from a REST catalog. This section closely follows Iceberg's Spark quickstart. It extends the docker-compose.yml
file in that guide to include Deephaven as part of the Iceberg Docker network. The Deephaven server is started alongside a Spark server, Iceberg REST API, and MinIO object store.
docker-compose.yml
version: '3'
services:
spark-iceberg:
image: tabulario/spark-iceberg
container_name: spark-iceberg
build: spark/
networks:
iceberg_net:
depends_on:
- rest
- minio
volumes:
- ./warehouse:/home/iceberg/warehouse
- ./notebooks:/home/iceberg/notebooks/notebooks
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
ports:
- 8888:8888
- 8081:8080
- 11000:10000
- 11001:10001
rest:
image: tabulario/iceberg-rest
container_name: iceberg-rest
networks:
iceberg_net:
ports:
- 8181:8181
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- CATALOG_S3_ENDPOINT=http://minio:9000
minio:
image: minio/minio
container_name: minio
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
- MINIO_DOMAIN=minio
networks:
iceberg_net:
aliases:
- warehouse.minio
ports:
- 9001:9001
- 9000:9000
command: ['server', '/data', '--console-address', ':9001']
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
networks:
iceberg_net:
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
entrypoint: >
/bin/sh -c "
until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
/usr/bin/mc mb minio/warehouse;
/usr/bin/mc policy set public minio/warehouse;
tail -f /dev/null
"
deephaven:
image: ghcr.io/deephaven/server:latest
networks:
iceberg_net:
ports:
- '${DEEPHAVEN_PORT:-10000}:10000'
environment:
- START_OPTS=-Dauthentication.psk=YOUR_PASSWORD_HERE
- USER
volumes:
- ./data:/data
- /home/${USER}/.aws:/home/${USER}/.aws
networks:
iceberg_net:
A full explanation of the docker-compose.yml
file is outside the scope of this guide.
The docker-compose.yml
file above sets the pre-shared key to YOUR_PASSWORD_HERE
. This doesn't meet security best practices, and should be changed in a production environment. For more, see pre-shared key authentication.
Run docker compose up
from the directory with the docker-compose.yml
file. This starts the Deephaven server, Spark server, Iceberg REST API, and MinIO object store. When you're done, a ctrl+C
or docker compose down
stops the containers.
Create an Iceberg catalog
This section follows the Iceberg Spark quickstart by creating an Iceberg catalog with a single table and snapshot using the Iceberg REST API in Jupyter. The docker-compose.yml extends that given in the Spark quickstart guide to include Deephaven as a service in the Iceberg Docker network. As such, the file starts up the following services:
- MinIO object store
- MinIO client
- Iceberg Spark server, reachable by Jupyter
- Deephaven server
Once the Docker containers are up and running, head to http://localhost:8888
to access the Iceberg Spark server in Jupyter. Open the Iceberg - Getting Started
notebook, which creates the Iceberg catalog using the Iceberg REST API. The first four code blocks create an Iceberg table called nyc.taxis
. Run this code to follow along with this guide, which uses the table in the sections below. All code blocks afterward are optional for our purposes.
If you already have an Iceberg catalog to query, skip to the next section, and replace the catalog URI, warehouse location, namespace, and table with your own.
Interact with the Iceberg catalog
After creating the Iceberg catalog and table, go to the Deephaven IDE at http://localhost:10000/ide
in your preferred browser.
To interact with an Iceberg catalog, you must first create an instance of the IcebergCatalogAdapter
class. An instance of the class is created with either of the following two methods:
adapter_s3_rest
:IcebergCatalogAdapter
created from an S3-compatible provider and a REST catalog.adapter_aws_glue
:IcebergCatalogAdapter
created from AWS Glue.
Each subsection below will focus on one of the above methods.
With a REST catalog and MinIO
The following code block creates an instance of the IcebergCatalogAdapter
class with adapter_s3_rest
. The method requires the catalog URI, warehouse location, region name, access key ID, secret access key, and endpoint override.
from deephaven.experimental import iceberg
local_adapter = iceberg.adapter_s3_rest(
name="minio-iceberg",
catalog_uri="http://rest:8181",
warehouse_location="s3a://warehouse/wh",
region_name="us-east-1",
access_key_id="admin",
secret_access_key="password",
end_point_override="http://minio:9000",
)
Once an IcebergCatalogAdapter
has been created, it can query the namespaces, tables, and snapshots in a catalog. The following code block gets the available top-level namespaces, tables in the nyc
namespace, and snapshots in the nyc.taxis
table.
namespaces = local_adapter.namespaces()
tables = local_adapter.tables("nyc")
snapshots = local_adapter.snapshots("nyc.taxis")
To load the nyc.taxis
Iceberg table, you must create an S3Instructions object with information about the region, keys, and endpoint of the Iceberg table. This object is then used to create an instance of the IcebergInstructions class, which is passed as an input to read_table
.
from deephaven.experimental import s3
s3_instructions = s3.S3Instructions(
region_name="us-east-1",
access_key_id="admin",
secret_access_key="password",
endpoint_override="http://minio:9000",
)
iceberg_instructions = iceberg.IcebergInstructions(data_instructions=s3_instructions)
taxis = local_adapter.read_table("nyc.taxis", instructions=iceberg_instructions)
With AWS Glue
The following code block creates an instance of the IcebergCatalogAdapter
class with adapter_aws_glue
. The method requires a catalog URI, and S3 warehouse location. A name can optionally be provided. If not given, it is inferred from the catalog URI.
AWS region and credential information must be made visible to Deephaven if running from Docker. The Docker Compose deployment used in this guide makes the default location visible to the Deephaven container via a volume mount. The default location could also be changed. See here for more information.
from deephaven.experimental import s3, iceberg
cloud_adapter = iceberg.adapter_aws_glue(
name="aws-iceberg",
catalog_uri="s3://lab-warehouse/sales",
warehouse_location="s3://lab-warehouse/sales",
)
Once an IcebergCatalogAdapter
has been created, it can query the namespaces, tables, and snapshots in a catalog. The following code block gets the available top-level namespaces, tables in the nyc
namespace, snapshots in the nyc.taxis
table, and the nyc.taxis
table itself.
namespaces = cloud_adapter.namespaces()
tables = cloud_adapter.tables("nyc")
snapshots = cloud_adapter.snapshots("nyc.taxis")
taxis = cloud_adapter.read_table("nyc.taxis")
Custom Iceberg instructions
Custom instructions can be specified when creating an IcebergInstructions
instance. A common example is renaming columns. Deephaven recommends columns follow PascalCase
naming conventions. The following code block renames certain columns in the nyc.taxis
table to follow this convention using MinIO and a REST catalog.
iceberg_instructions_renames = iceberg.IcebergInstructions(
data_instructions=s3_instructions,
column_renames={
"tpep_pickup_datetime": "PickupTime",
"tpep_dropoff_datetime": "DropoffTime",
"passenger_count": "NumPassengers",
"trip_distance": "Distance",
},
)
taxis = local_adapter.read_table("nyc.taxis", instructions=iceberg_instructions_renames)
Custom Iceberg instructions can also set the table definition. However, Deephaven automatically infers the correct data types for the nyc.taxis
table, so this is not needed. See IcebergInstructions
for more information.
Next steps
This guide presented a basic example of interacting with an Iceberg catalog in Deephaven. These examples can be extended to include more complex queries, catalogs with multiple namespaces, snapshots, custom instructions, and more.