Avro & Protobuf Schema Inference

Avro and Protobuf are widely used serialization formats for structured data, especially in streaming and Kafka-based workflows.

When to use Avro/Protobuf schema inference

Use Avro or Protobuf schema inference when:

  • Ingesting data from Kafka or other streaming platforms.
  • Working with complex, nested, or evolving data structures.
  • Automating schema creation for event-driven architectures.

Input requirements

  • Avro: Provide a valid Avro schema file (.avsc).
  • Protobuf: Provide a valid Protobuf descriptor file (.pb).
  • Ensure your files are accessible and UTF-8 encoded.

Kafka

Deephaven can generate schemas from Avro schema and Protobuf descriptors. See the examples below.

Discover a Deephaven schema from an Avro schema

You can discover and generate a Deephaven schema from an Avro schema file programmatically using the Groovy API. This is useful for advanced workflows, such as customizing namespace, table name, or handling nested Avro schemas.

import com.illumon.iris.db.schema.SchemaServiceFactory
import io.deephaven.kafka.ingest.SchemaDiscovery

// Replace "pageviews.avsc" with the path to your Avro schema file
ad = SchemaDiscovery.avroFactory(new File("pageviews.avsc"))
      .columnPartition("Date")
      .namespace("Kafka")
      .tableName("PageViews")

schema = ad.generateDeephavenSchema()
schemaService = SchemaServiceFactory.getDefault()
// Create the namespace if it doesn't already exist
schemaService.createNamespace("System", schema.getNamespace())
schemaService.addSchema(schema)

For more advanced usage, such as handling nested Avro schemas, see the Deephaven Javadoc.

Discover a Deephaven schema from a Protobuf descriptor

You can discover and generate a Deephaven schema from a Protobuf descriptor file programmatically using the Groovy API. This is helpful for advanced use cases, such as customizing namespace, table name, or handling complex Protobuf messages.

import com.illumon.iris.db.schema.SchemaServiceFactory
import io.deephaven.kafka.ingest.SchemaDiscovery

// Replace "trade.desc" with the path to your compiled Protobuf descriptor file
pd = SchemaDiscovery.protobufFactory(new File("trade.desc"))
      .messageName("Trade")
      .columnPartition("Date")
      .namespace("Kafka")
      .tableName("Trades")

schema = pd.generateDeephavenSchema()
schemaService = SchemaServiceFactory.getDefault()
// Create the namespace if it doesn't already exist
schemaService.createNamespace("System", schema.getNamespace())
schemaService.addSchema(schema)
  • .messageName("Trade") specifies the Protobuf message type to use from the descriptor.
  • .columnPartition("Date") specifies the partition column (required for in-worker DIS ingestion).
  • .namespace("Kafka") and .tableName("Trades") let you override the namespace and table name.
  • .columnPartition("Date") specifies the partition column (required for in-worker DIS ingestion).
  • .namespace("Kafka") and .tableName("PageViews") let you override the namespace and table name defined in the Avro schema if desired.

Troubleshooting

  • Invalid schema/descriptor: Ensure your Avro or Protobuf file is valid and accessible.
  • Missing or unsupported types: Review the generated schema and manually adjust for any unsupported or custom types.
  • Kafka integration issues: See the Kafka streaming guide.
  • Encoding issues: Ensure your files are UTF-8 encoded.