Deephaven and Iceberg

Deephaven is a powerful analytics engine that makes processing large data more intuitive than ever. Iceberg is a table format that provides fast, efficient, and scalable data storage. Combining the two is like bringing Holmes and Watson together to solve a mystery. In this blog, we'll explore Deephaven's new Iceberg integration, why it matters, how to use it, and what's to come.

Deephaven already has integrations with SQL, Parquet, Kafka, and CSV, which can all be used as storage backend for a Deephaven-powered application. Now Iceberg is part of that list as well. If you're looking for a scalable, efficient, reliable, and cloud-native way to store your data and fetch it into Deephaven, look no further.

To follow along with this blog, you'll need Docker. This specifically uses Docker Compose to manage the services, so check out the links if you're unfamiliar with them.

Iceberg is now available as a storage backend for Deephaven, providing a scalable and efficient cloud-native storage mechanism for powerful applications.

A configuration for Deephaven and Iceberg

To use Deephaven in tandem with Iceberg, you'll need a configuration that allows the two to work together. Docker Compose is perfect for this. Below is an extended version of the YAML file found in Iceberg's Spark Quickstart—it adds Deephaven as a service and makes it part of the iceberg_net Docker network so the services can communicate.

docker-compose.yml

services:
  spark-iceberg:
    image: tabulario/spark-iceberg
    container_name: spark-iceberg
    build: spark/
    networks:
      iceberg_net:
    depends_on:
      - rest
      - minio
    volumes:
      - ./warehouse:/home/iceberg/warehouse
      - ./notebooks:/home/iceberg/notebooks/notebooks
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    ports:
      - 8888:8888
      - 8081:8080
      - 11000:10000
      - 11001:10001
  rest:
    image: tabulario/iceberg-rest
    container_name: iceberg-rest
    networks:
      iceberg_net:
    ports:
      - 8181:8181
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
      - CATALOG_WAREHOUSE=s3://warehouse/
      - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
      - CATALOG_S3_ENDPOINT=http://minio:9000
  minio:
    image: minio/minio
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_DOMAIN=minio
    networks:
      iceberg_net:
        aliases:
          - warehouse.minio
    ports:
      - 9001:9001
      - 9000:9000
    command: ['server', '/data', '--console-address', ':9001']
  mc:
    depends_on:
      - minio
    image: minio/mc
    container_name: mc
    networks:
      iceberg_net:
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    entrypoint: >
      /bin/sh -c "
      until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc mb minio/warehouse;
      /usr/bin/mc policy set public minio/warehouse;
      tail -f /dev/null
      "
  deephaven:
    image: ghcr.io/deephaven/server:latest
    networks:
      iceberg_net:
    ports:
      - '${DEEPHAVEN_PORT:-10000}:10000'
    volumes:
      - ./data:/data
    environment:
      - START_OPTS=-Xmx16g -DAuthHandlers=io.deephaven.auth.AnonymousAuthenticationHandler
networks:
  iceberg_net:

You can start these services with a single command:

docker compose up

Create an Iceberg catalog

Iceberg stores data in tables just like Deephaven does. Iceberg tables live inside of catalogs, similar to how directories store files in a filesystem. Creating an Iceberg catalog is pretty easy - the Docker Compose configuration above has access to an Iceberg Spark server in Jupyter at http://localhost:8888. Head there and open the Iceberg - Getting Started notebook. The first four code blocks in that notebook will create an Iceberg table called nyc.taxis. You'll read this table into Deephaven in the next section.

Interact with the Iceberg catalog

After creating the Iceberg catalog, head over to the Deephaven IDE at http://localhost:10000/ide in your preferred browser. To interact with an Iceberg catalog from Deephaven, you'll first need to create an IcebergCatalogAdapter. You can create one in two different ways:

adapter_s3_rest: Creates an IcebergCatalogAdapter from an S3-compatible provider and a REST catalog.
adapter_aws_glue: Creates an IcebergCatalogAdapter from an AWS Glue catalog.

Each method will be used below.

With a REST catalog and MinIO

The following code block uses adapter_s3_rest to create an IcebergCatalogAdapter. It requires the catalog URI, warehouse location, region name, access key ID, secret access key, and an endpoint override.

from deephaven.experimental import iceberg

local_adapter = iceberg.adapter_s3_rest(
    name="minio-iceberg",
    catalog_uri="http://rest:8181",
    warehouse_location="s3a://warehouse/wh",
    region_name="us-east-1",
    access_key_id="admin",
    secret_access_key="password",
    end_point_override="http://minio:9000",
)

With the catalog adapter in hand, you can now query namespaces, tables, and snapshots in an Iceberg catalog:

namespaces = local_adapter.namespaces()
tables = local_adapter.tables("nyc")
snapshots = local_adapter.snapshots("nyc.taxis")

To read an Iceberg table via a REST catalog and S3-compatible driver, you also need to define custom IcebergInstructions. In this case, the instructions give the region name, access key ID, secret access key, and endpoint override for the REST catalog:

from deephaven.experimental import s3

s3_instructions = s3.S3Instructions(
    region_name="us-east-1",
    access_key_id="admin",
    secret_access_key="password",
    endpoint_override="http://minio:9000",
)

iceberg_instructions = iceberg.IcebergInstructions(data_instructions=s3_instructions)

taxis = local_adapter.read_table("nyc.taxis", instructions=iceberg_instructions)

With an AWS Glue catalog

Important

To use the AWS Glue catalog adapter, you need an AWS region and credentials. If you are running Deephaven from Docker, mount your AWS credentials as a volume in the Deephaven container. For more information, see here.

adapter_aws_glue creates an IcebergCatalogAdapter from an AWS Glue catalog. It requires a name, catalog URI, and S3 warehouse location:

from deephaven.experimental import s3, iceberg

cloud_adapter = iceberg.adapter_aws_glue(
    name="aws-iceberg",
    catalog_uri="s3://lab-warehouse/nyc",
    warehouse_location="s3://lab-warehouse/nyc",
)

When using an AWS Glue catalog, custom Iceberg instructions are not required, so you can read the table directly:

namespaces = cloud_adapter.namespaces()
tables = cloud_adapter.tables("nyc")
snapshots = cloud_adapter.snapshots("nyc.taxis")
taxis = cloud_adapter.read_table("nyc.taxis")

Want to see Iceberg in action? Check out this developer demo from Larry:

What's to come

This blog has demonstrated how to import Iceberg tables into Deephaven using a simple example. While these examples are straightforward, they illustrate the workflow for reading Iceberg tables into Deephaven. Our Iceberg integration is currently in development, so you can anticipate additional features, improvements, and examples in the future. As a preview, you can look forward to:

Support for refreshing (ticking) Iceberg tables
Generic adapters to make it even easier to interact with Iceberg catalogs
Writing to Iceberg tables

Reach out

Our Slack community continues to grow! Reach out to us with any questions, comments, or feedback. We'd love to hear from you!

Big data storage meets big data processing