JDBC Schema Inference

Schema inference for JDBC sources allows Deephaven to automatically generate table schemas from relational databases, streamlining data integration and reducing manual effort. This approach is especially useful when onboarding new databases or when the schema may evolve over time.

When connecting to a JDBC source, Deephaven inspects the metadata provided by the database to infer column names, types, and constraints. The quality and detail of the inferred schema depend on the database's metadata support. It is recommended to review the generated schema for accuracy and completeness, especially for complex or non-standard database types.

Refer to the JDBC batch data guide for more details on connecting and importing data from JDBC sources.

By default, the JDBC schema generator requests the data types of the result set columns from the data source. These data types are mapped to Deephaven data types to provide the schema details.

For cases where the data source cannot provide column data type information, or where the source database is not ideally configured to store the data (e.g., numbers or date/times in String fields), the JDBC schema generator provides options to scan data as is done in the CSV schema generator, or to use a combination of data scanning and metadata.

When to use JDBC schema inference

Use JDBC schema inference when:

Integrating new relational databases into Deephaven.
The database schema may evolve over time.
You want to avoid manual mapping of columns and types.
The database provides reliable metadata for tables and queries.

Note that we limit the size of the result to avoid pulling the entire table into memory.

Input requirements

JDBC connection details (URL, driver, credentials) must be available.
Database should provide metadata for best results; otherwise, data scanning is used.
Ensure you have permission to access the database and run the target query.

Example

This command creates ExNamespace.Trade.schema in the /tmp directory:

iris_exec jdbc_schema_creator -- \
  --namespace ExNamespace \
  --tableName Trade \
--connectionUrl "jdbc:sqlserver://<host>;database=<db>;user=<user>;password=<pwd>" \
  --jdbcDriver "com.microsoft.sqlserver.jdbc.SQLServerDriver" \
  --jdbcQuery "SELECT TOP 1000 * FROM trade" \
  --schemaPath /tmp

Where:

--namespace: The namespace for the new schema.
--tableName: The name of the table to create.
--connectionUrl: The JDBC connection string for your database.
--jdbcDriver: The fully qualified class name of the JDBC driver.
--jdbcQuery: The SQL query to use for inferring the schema.
--schemaPath: The output directory for the generated schema file.

Command Reference

iris_exec jdbc_schema_creator <launch args> -- <schema creator args>

The following arguments are available when running the JDBC schema creator:

Argument	Description
`-ns` or `--namespace <namespace>`	(Required) The namespace to use for the new schema.
`-tn` or `--tableName <name>`	(Required) The table name to use for the new schema.
`-sp` or `--schemaPath <path>`	An optional path to which the schema file will be written. If not specified, this defaults to the current working directory and will create or use a subdirectory that matches the namespace.
`-jd` or `--driverClass <fully qualified class name of the JDBC driver>`	The fully-qualified class name of the driver to load to connect to the data source. In most cases, the corresponding jar file will need to be available in the classpath of the schema creator process.
`-cu` or `--connectionUrl <data source specific connection URL>`	Information needed to find and connect to the data source. See below for more details.
`-jq` or `--jdbcQuery <SQL query to execute against the data source>`	The full query string to execute against the data source. In most cases, this will need to be delimited with double-quotes.
`-am` or `--assessmentMethod <META_ONLY, DATA_ONLY, or BOTH>`	How to evaluate column types from the JDBC data source. `META_ONLY` (default) uses the metadata of the result set that is provided by the JDBC driver, mapping JDBC data types to Deephaven data types. `DATA_ONLY` scans the data from the result set similarly to how the CSV schema creator scans data, which is useful for data sources that do not support many data types, those that do not provide metadata, or for situations where data is stored non-optimally in the source (e.g., date/time values in string columns). `BOTH` will use metadata first, and then validate and refine the results by scanning the actual data.
`-pc` or `--partitionColumn`	Optional name for the partitioning column if schema is being generated. If this is not provided, the importer will default to Date for the partitioning column name. Any existing column from the source that matches the name of the partitioning column will be renamed to "source_[original column name]".
`-gc` or `--groupingColumn`	Optional column name that should be marked as `columnType="Grouping"` in the schema. If multiple grouping columns are needed, the generated schema should be manually edited to add the Grouping designation to the additional columns.
`-spc` or `--sourcePartitionColumn`	Optional column to use for multi-partition imports. For example, if the partitioning column is "Date" and you want to enable multi-partition imports based on the column source_date, specify "source_date" with this option (Note: This is the column name in the data source, not the Deephaven column name).
`-sn` or `--sourceName <name for ImportSource block>`	Optional name to use for the generated `ImportSource` block in the schema. If not provided, the default of "IrisJDBC" will be used.
`-lp` or `--logProgress`	If present, additional informational logging will be provided with progress updates during the parsing process.
`-bf` or `--bestFit`	If present, the class will attempt to use the smallest numeric types to fit the data. For example, for integer values, short will be used if all values are in the range of -32768 to 32767. If larger values are seen, the class will then move to int and eventually long. The default behavior, without `-bf`, is to use long for all integer columns and double for all floating point columns.
`-om` or `--outputMode`	Either `SAFE` (default) or `REPLACE`. When `SAFE`, the schema creator will exit with an error if an existing file would be overwritten by the schema being generated. When set to `REPLACE`, a pre-existing file will be overwritten.
`-is` or `--importStyle`	Either `BATCH` (default) or `CDC_REPLICATION`. `CDC_REPLICATION` indicates the schema should be configured to use a SQLReplicationLogger as the logger interface and take a SQLReplicationRow object as input data. Reasonable defaults for converting SQL types to database types will also be generated for each column.
`-op` or `--outPackage`	The output package name (i.e., "com.yourcompany.iris.db.gen") when generating loggers and listeners. Required if generating loggers & listeners.
`-lf` or `--logFormat`	(Optional) Default unspecified. Logger format number. Used to version log files if table schema is changed.
`-un` or `--useNanos`	(Optional) Defaults to `true`. If `true`, each DBDateTime column will indicate `timePrecision="Nanos"`, and the logger will expect a long value with nanos. If `false`, the loggers and listeners will expect millisecond values.
`-sc` or `--sourceCasing`	Optional unless destination casing is given. Specifies the casing to expect for source column names. If both this and destination casing are specified, column names will be converted accordingly. Legal values are: `UPPER_UNDERSCORE`, `LOWER_UNDERSCORE`, `UPPER_CAMEL`, `LOWER_CAMEL`, and `LOWER_HYPHEN`.
`-dc` or `--destCasing`	Optional unless source casing is given. Determines the Deephaven column name format. If both this and destination casing are specified, column names will be converted accordingly. Legal values are: `UPPER_UNDERSCORE`, `LOWER_UNDERSCORE`, `UPPER_CAMEL`, `LOWER_CAMEL`, and `LOWER_HYPHEN`.

Troubleshooting

JDBC driver errors: Ensure the correct JDBC driver is available and referenced in your configuration.
Permission issues: Verify your database user account has permission to access table metadata and execute queries.
Non-standard or ambiguous types: Review the generated schema for columns with generic or unexpected types, and manually adjust if needed.
Connection problems: Double-check the connection URL, credentials, and network access to the database server.