Skip to main content
Version: Python

Iceberg and Deephaven

Apache Iceberg is a high-performance format for tabular data. Deephaven's Iceberg integration enables users to interact with Iceberg catalogs, namespaces, tables, and snapshots by ingesting them as tables. This guide creates an Iceberg catalog with a single table and snapshot. It then walks through how to interact with the catalog in the Deephaven IDE through:

  • A REST API and MinIO instance
  • S3 storage providers

Deephaven's Iceberg module

Deephaven's Iceberg integration is provided by the deephaven.experimental.iceberg module. The module contains two classes and two functions:

Querying Iceberg tables in Deephaven uses the deephaven.experimental.s3 module to pull data from S3-compatible providers.

A Deephaven deployment for Iceberg

The examples presented in this guide pull Iceberg data from a REST catalog. This section closely follows Iceberg's Spark quickstart. It extends the docker-compose.yml file in that guide to include Deephaven as part of the Iceberg Docker network. The Deephaven server is started alongside a Spark server, Iceberg REST API, and MinIO object store.

docker-compose.yml
version: '3'

services:
spark-iceberg:
image: tabulario/spark-iceberg
container_name: spark-iceberg
build: spark/
networks:
iceberg_net:
depends_on:
- rest
- minio
volumes:
- ./warehouse:/home/iceberg/warehouse
- ./notebooks:/home/iceberg/notebooks/notebooks
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
ports:
- 8888:8888
- 8081:8080
- 11000:10000
- 11001:10001
rest:
image: tabulario/iceberg-rest
container_name: iceberg-rest
networks:
iceberg_net:
ports:
- 8181:8181
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- CATALOG_S3_ENDPOINT=http://minio:9000
minio:
image: minio/minio
container_name: minio
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
- MINIO_DOMAIN=minio
networks:
iceberg_net:
aliases:
- warehouse.minio
ports:
- 9001:9001
- 9000:9000
command: ['server', '/data', '--console-address', ':9001']
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
networks:
iceberg_net:
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
entrypoint: >
/bin/sh -c "
until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
/usr/bin/mc mb minio/warehouse;
/usr/bin/mc policy set public minio/warehouse;
tail -f /dev/null
"
deephaven:
image: ghcr.io/deephaven/server:latest
networks:
iceberg_net:
ports:
- '${DEEPHAVEN_PORT:-10000}:10000'
environment:
- START_OPTS=-Dauthentication.psk=YOUR_PASSWORD_HERE
- USER
volumes:
- ./data:/data
- /home/${USER}/.aws:/home/${USER}/.aws
networks:
iceberg_net:
note

A full explanation of the docker-compose.yml file is outside the scope of this guide.

info

The docker-compose.yml file above sets the pre-shared key to YOUR_PASSWORD_HERE. This doesn't meet security best practices, and should be changed in a production environment. For more, see pre-shared key authentication.

Run docker compose up from the directory with the docker-compose.yml file. This starts the Deephaven server, Spark server, Iceberg REST API, and MinIO object store. When you're done, a ctrl+C or docker compose down stops the containers.

Create an Iceberg catalog

This section follows the Iceberg Spark quickstart by creating an Iceberg catalog with a single table and snapshot using the Iceberg REST API in Jupyter. The docker-compose.yml extends that given in the Spark quickstart guide to include Deephaven as a service in the Iceberg Docker network. As such, the file starts up the following services:

  • MinIO object store
  • MinIO client
  • Iceberg Spark server, reachable by Jupyter
  • Deephaven server

Once the Docker containers are up and running, head to http://localhost:8888 to access the Iceberg Spark server in Jupyter. Open the Iceberg - Getting Started notebook, which creates the Iceberg catalog using the Iceberg REST API. The first four code blocks create an Iceberg table called nyc.taxis. Run this code to follow along with this guide, which uses the table in the sections below. All code blocks afterward are optional for our purposes.

note

If you already have an Iceberg catalog to query, skip to the next section, and replace the catalog URI, warehouse location, namespace, and table with your own.

Interact with the Iceberg catalog

After creating the Iceberg catalog and table, go to the Deephaven IDE at http://localhost:10000/ide in your preferred browser.

To interact with an Iceberg catalog, you must first create an instance of the IcebergCatalogAdapter class. An instance of the class is created with either of the following two methods:

Each subsection below will focus on one of the above methods.

With a REST catalog and MinIO

The following code block creates an instance of the IcebergCatalogAdapter class with adapter_s3_rest. The method requires the catalog URI, warehouse location, region name, access key ID, secret access key, and endpoint override.

from deephaven.experimental import iceberg

local_adapter = iceberg.adapter_s3_rest(
name="minio-iceberg",
catalog_uri="http://rest:8181",
warehouse_location="s3a://warehouse/wh",
region_name="us-east-1",
access_key_id="admin",
secret_access_key="password",
end_point_override="http://minio:9000",
)

Once an IcebergCatalogAdapter has been created, it can query the namespaces, tables, and snapshots in a catalog. The following code block gets the available top-level namespaces, tables in the nyc namespace, and snapshots in the nyc.taxis table.

namespaces = local_adapter.namespaces()
tables = local_adapter.tables("nyc")
snapshots = local_adapter.snapshots("nyc.taxis")

img

To load the nyc.taxis Iceberg table, you must create an S3Instructions object with information about the region, keys, and endpoint of the Iceberg table. This object is then used to create an instance of the IcebergInstructions class, which is passed as an input to read_table.

from deephaven.experimental import s3

s3_instructions = s3.S3Instructions(
region_name="us-east-1",
access_key_id="admin",
secret_access_key="password",
endpoint_override="http://minio:9000",
)

iceberg_instructions = iceberg.IcebergInstructions(data_instructions=s3_instructions)

taxis = local_adapter.read_table("nyc.taxis", instructions=iceberg_instructions)

img

With AWS Glue

The following code block creates an instance of the IcebergCatalogAdapter class with adapter_aws_glue. The method requires a catalog URI, and S3 warehouse location. A name can optionally be provided. If not given, it is inferred from the catalog URI.

info

AWS region and credential information must be made visible to Deephaven if running from Docker. The Docker Compose deployment used in this guide makes the default location visible to the Deephaven container via a volume mount. The default location could also be changed. See here for more information.

from deephaven.experimental import s3, iceberg

cloud_adapter = iceberg.adapter_aws_glue(
name="aws-iceberg",
catalog_uri="s3://lab-warehouse/sales",
warehouse_location="s3://lab-warehouse/sales",
)

Once an IcebergCatalogAdapter has been created, it can query the namespaces, tables, and snapshots in a catalog. The following code block gets the available top-level namespaces, tables in the nyc namespace, snapshots in the nyc.taxis table, and the nyc.taxis table itself.

namespaces = cloud_adapter.namespaces()
tables = cloud_adapter.tables("nyc")
snapshots = cloud_adapter.snapshots("nyc.taxis")
taxis = cloud_adapter.read_table("nyc.taxis")

img

Custom Iceberg instructions

Custom instructions can be specified when creating an IcebergInstructions instance. A common example is renaming columns. Deephaven recommends columns follow PascalCase naming conventions. The following code block renames certain columns in the nyc.taxis table to follow this convention using MinIO and a REST catalog.

iceberg_instructions_renames = iceberg.IcebergInstructions(
data_instructions=s3_instructions,
column_renames={
"tpep_pickup_datetime": "PickupTime",
"tpep_dropoff_datetime": "DropoffTime",
"passenger_count": "NumPassengers",
"trip_distance": "Distance",
},
)

taxis = local_adapter.read_table("nyc.taxis", instructions=iceberg_instructions_renames)

img

Custom Iceberg instructions can also set the table definition. However, Deephaven automatically infers the correct data types for the nyc.taxis table, so this is not needed. See IcebergInstructions for more information.

Next steps

This guide presented a basic example of interacting with an Iceberg catalog in Deephaven. These examples can be extended to include more complex queries, catalogs with multiple namespaces, snapshots, custom instructions, and more.