Iceberg

Apache Iceberg is an open table format for large analytic datasets. Deephaven Enterprise supports reading Iceberg tables through the iceberg extended storage type, which maps Iceberg tables to Deephaven schemas. This enables querying Iceberg data alongside other Deephaven tables, with full support for Iceberg schema evolution and nested structures.

Overview

To use an Iceberg table in Deephaven:

Configure catalog access — Specify how to connect to your Iceberg catalog (Glue, REST, Hive, etc.).
Identify the table — Provide the Iceberg namespace and table name.
Deploy the schema — Register the table with Deephaven's schema service.

Deephaven uses the iceberg extended storage type in schemas, which references the catalog configuration and table identifier. You can create these schemas programmatically (recommended) or manually via XML.

Configuration

Deephaven works with any Iceberg catalog that implements the standard Catalog interface. Use the type property to specify built-in catalog types (e.g., glue, rest, hive, hadoop, nessie, jdbc), or use catalog-impl to specify a custom implementation class.

The easiest way to configure Iceberg tables for Deephaven is to use the built-in inference provided by LoadTableOptions. Advanced users can customize the mapping with a Resolver, or with custom inference options.

Required properties

Every Iceberg catalog configuration requires:

type or catalog-impl — The catalog type or fully-qualified implementation class.

Most catalog types also require:

warehouse — The URI where table data is stored (e.g., s3://bucket/warehouse).

Additional properties depend on your catalog type. See Iceberg Catalog properties for the full reference.

import io.deephaven.enterprise.iceberg.IcebergTableOptions
import io.deephaven.iceberg.util.BuildCatalogOptions
import io.deephaven.iceberg.util.LoadTableOptions

// The options to read an Iceberg Catalog.
catalogOptions = BuildCatalogOptions.builder()
    .name("MyCatalog")
    .putAllProperties([
        "type": "<type>",
        // ... additional catalog properties here ...
    ])
    .build()

// The options to load an Iceberg Table from the catalog, uses default
// inference options.
tableOptions = LoadTableOptions.builder()
    .id("IcebergNamespace.IcebergTableName")
    .build()

// Load the Iceberg Catalog and Table and materialize the results into
// an explicit Resolver.
options = IcebergTableOptions.builder()
    .tableKey("DHNamespace", "DHTableName")
    .catalogOptions(catalogOptions)
    .tableOptions(tableOptions)
    .build()
    .materialize()

// Verify the resulting Table data looks correct
myTable = options.table()

If you have a working Spark configuration, that can typically be translated into the necessary Catalog properties by removing the Spark prefix.

For example, the following Spark properties:

spark.sql.catalog.rest_prod = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.rest_prod.type = rest
spark.sql.catalog.rest_prod.uri = http://localhost:8080
spark.sql.catalog.rest_prod.warehouse = s3://my-bucket/warehouse

Translate into the given BuildCatalogOptions:

catalogOptions = BuildCatalogOptions.builder()
    .name("rest_prod")
    .putAllProperties([
        "type": "rest",
        "uri": "http://localhost:8080",
        "warehouse": "s3://my-bucket/warehouse"
    ])
    .build()

See BuildCatalogOptions and LoadTableOptions for more details on these structures.

Deployment

An iris-schemamanagers user is required to deploy the Schema.

import com.illumon.iris.db.schema.SchemaServiceFactory

schemaService = SchemaServiceFactory.getDefault()

// Admins are encouraged to explicitly manage the deployment logic of their Schema with
// the SchemaService.
mySchema = options.schema().get()
schemaService.addSchema(mySchema)

// Alternatively, admins may deploy the schema directly using the IcebergTableOptions.
// The following will create the schema namespace if it doesn't exist, create the schema if it doesn't exist, or update it if it does.
// options.deploy(schemaService)

// Verify that the table can be fetched
myTableViaDb = db.historicalTable("DHNamespace", "DHTableName")

Serialization format

Caution

The creation and deployment of a Deephaven Iceberg schema is typically performed programmatically, as shown in the previous sections. Exercise caution when manually creating or editing a schema.

An Iceberg table is referenced in a Deephaven table's schema using an ExtendedStorage element with the attribute type set to iceberg.

<Table namespace="DHNamespace" name="DHTableName" namespaceSet="System" storageType="Extended">
  <!-- Column elements omitted for brevity -->
  <ExtendedStorage type="iceberg">
    <Catalog><!-- see `Catalog` section below --></Catalog>
    <Table><!-- see `Table` section below --></Table>
  </ExtendedStorage>
</Table>

`Catalog` element

The Catalog element is a serialization of the core BuildCatalogOptions. It is composed of a Name, Properties, and optional HadoopConfig element. The Properties element is a map of string keys to string values. The optional HadoopConfig element is optional and is an additional map for Hadoop catalogs. For example:

<Catalog>
  <Name>MyCatalog</Name>
  <Properties injection="enabled">
    <Entry key="type" value="<type>" />
    <!--
    <Entry key="warehouse" value="..." />
    -->
  </Properties>
  <!--
  <HadoopConfig>
    <Entry key="..." value="...">
  </HadoopConfig>
  -->
</Catalog>

The injection attribute on the Properties element controls whether Deephaven may automatically add properties that work around known upstream issues and/or supply defaults needed for Deephaven's Iceberg usage. The valid values are enabled and disabled. It is recommended to set this to enabled.

`Table` element

The Table element is a serialization of the core LoadTableOptions. It is composed of a TableIdentifier, Resolver, and NameMapping element.

<Table>
  <TableIdentifier>IcebergNamespace.IcebergTableName</TableIdentifier>
  <Resolver><!-- see `Resolver` section below --></Resolver>
  <NameMapping><!-- see `NameMapping` section below --></NameMapping>
</Table>

The Resolver element contains a ColumnInstructions, Schema, and optional PartitionSpec element. The ColumnInstructions element contains the mapping from Deephaven column names to Iceberg fieldId, Iceberg partitionFieldId, or type unmapped. The Schema element contains the Iceberg Schema JSON. The optional PartitionSpec element contains the Iceberg Partition Spec JSON.

<Resolver type="direct">
  <ColumnInstructions>
    <Column name="Foo" partitionFieldId="1000" />
    <Column name="Bar" fieldId="2" />
    <Column name="Baz" type="unmapped" />
  </ColumnInstructions>
  <Schema type="json"><![CDATA[{
  "type" : "struct",
  "schema-id" : 0,
  "fields" : [ {
    "id" : 1,
    "name" : "foo",
    "required" : false,
    "type" : "int"
  }, {
    "id" : 2,
    "name" : "bar",
    "required" : false,
    "type" : "long"
  } ]
}]]></Schema>
  <PartitionSpec type="json"><![CDATA[{
  "spec-id" : 0,
  "fields" : [ {
    "name" : "foo",
    "transform" : "identity",
    "source-id" : 1,
    "field-id" : 1000
  } ]
}]]></PartitionSpec>
</Resolver>

The NameMapping element provides fallback field ids to be used when a data file does not contain field id information. It has three different types, specified via the type attribute.

The table type means to read the Name Mapping from the Iceberg Table property schema.name-mapping.default (see https://iceberg.apache.org/spec/#column-projection).

<NameMapping type="table" />

The empty type means to not use name mapping.

<NameMapping type="empty" />

The json type uses Iceberg Name Mapping JSON.

<NameMapping type="json"><![CDATA[[ {
  "field-id" : 1,
  "names" : [ "Foo" ]
}, {
  "field-id" : 2,
  "names" : [ "Bar" ]
} ]]]></NameMapping>

Full example

This example uses a Glue catalog, but the same pattern applies to any Iceberg catalog — substitute your catalog's type and properties as needed.

Assume an existing Iceberg table mycatalog.cities with the following schema:

{
  "type": "struct",
  "schema-id": 1,
  "fields": [
    {
      "id": 1,
      "name": "city",
      "required": false,
      "type": "string"
    },
    {
      "id": 2,
      "name": "latitude",
      "required": false,
      "type": "double"
    },
    {
      "id": 3,
      "name": "longitude",
      "required": false,
      "type": "double"
    }
  ]
}

To create a new Deephaven Schema with namespace DhExample and table name Cities that references this Iceberg Table, we would execute the following once:

import io.deephaven.enterprise.iceberg.IcebergTableOptions
import io.deephaven.iceberg.util.BuildCatalogOptions
import io.deephaven.iceberg.util.LoadTableOptions
import com.illumon.iris.db.schema.SchemaServiceFactory

catalogOptions = BuildCatalogOptions.builder()
    .name("GlueCatalog")
    .putAllProperties(["type": "glue"])
    .build()

tableOptions = LoadTableOptions.builder()
    .id("mycatalog.cities")
    .build()

options = IcebergTableOptions.builder()
    .tableKey("DhExample", "Cities")
    .catalogOptions(catalogOptions)
    .tableOptions(tableOptions)
    .build()
    .materialize()

SchemaServiceFactory.getDefault().addSchema(options.schema().get())

This would result in the following Deephaven Schema:

<Table namespace="DhExample" name="Cities" namespaceSet="System" storageType="Extended">
  <Column name="city" dataType="String" columnType="Normal" />
  <Column name="latitude" dataType="double" columnType="Normal" />
  <Column name="longitude" dataType="double" columnType="Normal" />
  <ExtendedStorage type="iceberg">
    <Catalog>
      <Name>GlueCatalog</Name>
      <Properties injection="enabled">
        <Entry key="glue" />
      </Properties>
    </Catalog>
    <Table>
      <TableIdentifier>mycatalog.cities</TableIdentifier>
      <Resolver type="direct">
        <ColumnInstructions>
          <Column name="city" fieldId="1" />
          <Column name="latitude" fieldId="2" />
          <Column name="longitude" fieldId="3" />
        </ColumnInstructions>
        <Schema type="json"><![CDATA[{
  "type" : "struct",
  "schema-id" : 1,
  "fields" : [ {
    "id" : 1,
    "name" : "city",
    "required" : false,
    "type" : "string"
  }, {
    "id" : 2,
    "name" : "latitude",
    "required" : false,
    "type" : "double"
  }, {
    "id" : 3,
    "name" : "longitude",
    "required" : false,
    "type" : "double"
  } ]
}]]></Schema>
      </Resolver>
      <NameMapping type="empty" />
    </Table>
  </ExtendedStorage>
</Table>