Schema Inference

Deephaven provides schema generation tools to make schema creation easier. This guide shows you how to generate schemas from various data sources. Schemas can be inferred from:

CSV

CSV refers to a variety of text file data formats. Although "comma-separated" is in the name, the delimiter could, in fact, be some other character (such as a tab or semicolon). Other details of formatting are similarly flexible. There is no single official standard for CSV formatting, so Deephaven provides flexibility when defining how to interpret and transform text data from CSV files into structured storage in Deephaven tables.

Because CSV is a text file format that does not provide metadata, the schema generation tool must analyze the data it finds in a sample data file. Column names are typically derived from column headers, but there are options for CSV files with and without headers. The schema generation process can be simplified and accelerated by using a sample of data, as long as there are enough rows to clearly demonstrate the formats and ranges of data values that will be included in normal imports.

For non-String columns, the schema generator will scan all the values in the supplied CSV file to ensure the inferred data type remains valid for all values in the column. As an example, the first value for a column might be 4. This would cause the schema generator to tentatively categorize this column as an integer type (short or long, depending on settings). If a 5.5 were encountered in a subsequent row, the generator would "promote" the column to a floating point datatype. If another value in the column included non-numeric characters, such as XYZ, the generator would then mark the column as a String column. Once a column has been designated as a String column, no other checking is done, as Strings are always valid to accept any data from a CSV.

If a column is empty for all rows, it is marked as a String, and a warning message is written to the log. In this case, the administrator should examine and validate the generated schema before adding it to Deephaven.

Date/time values are also a special case, where the schema generator will attempt to match the format of the date/time string to a known format for which it can provide a conversion formula. If a date/time column has multiple formats, it will be marked as a String column, since no single formula could be found that matched all the column's data.

This example creates CSVExampleNamespace.CSVExampleTableName.schema in the /tmp directory:

iris_exec csv_schema_creator -- --namespace CSVExampleNamespace --tableName CSVExampleTableName --sourceFile /data/sample.csv --schemaPath /tmp

Command Reference

iris_exec csv_schema_creator <launch args> -- <schema creator args>

The following arguments are available when running the CSV schema creator:

ArgumentDescription
-ns or --namespace <namespace>(Required) The namespace to use for the new schema.
-tn or --tableName <name>(Required) The table name to use for the new schema.
-sp or --schemaPath <path>An optional path to which the schema file will be written. If not specified, this defaults to the current working directory and will create or use a subdirectory that matches the namespace.
-sf or --sourceFile <file name or file path and name>The name of the CSV file to read (required). This file must have a header row with column names.
-fd or --delimiter <delimiter character>Field delimiter (optional). Allows specification of a character other than the file format default as the field delimiter. If delimiter is specified, fileFormat is ignored. This must be a single character.
-ff or --fileFormat <format name>(Optional) The Apache Commons CSV parser is used to parse the file itself. Five common formats are supported:
  • DEFAULT – default format if none is specified; comma-separated field values, newline row terminators, double-quotes around field values that contain embedded commas, newline characters, or double-quotes.
  • TRIM - Similar to DEFAULT, but will trim all whitespace around values.
  • EXCEL – Microsoft Excel CSV format.
  • MYSQL – MySQL CSV format.
  • RFC4180 – IETF RFC 4180 MIME text/csv format.
  • TDF – Tab-delimited format.
-pc or --partitionColumnOptional name for the partitioning column if schema is being generated. If not provided, the importer will default to "Date" for the partitioning column name. Any existing column from the source that matches the name of the partitioning column will be renamed to "source_[original column name]".
-gc or --groupingColumnOptional column name that should be marked as columnType="Grouping" in the schema. If multiple grouping columns are needed, the generated schema should be manually edited to add the Grouping designation to the additional columns.
-spc or --sourcePartitionColumnOptional column to use for multi-partition imports. For example, if the partitioning column is "Date" and you want to enable multi-partition imports based on the column "source_date", specify "source_date" with this option (this is the column name in the data source, not the Deephaven column name).
-sn or --sourceName <name for ImportSource block>Optional name to use for the generated ImportSource block in the schema. If not provided, the default of "IrisCSV" will be used.
-sl or --skipHeaderLines <integer value>Optional number of lines to skip from the beginning of the file before expecting the header row. If not provided, the first line is used as the header row.
-fl or --setSkipFooterLines <integer value>Optional number of footer lines to skip from the end of the file.
-lp or --logProgressIf present, additional informational logging will be provided with progress updates during the parsing process.
-bf or --bestFitIf present, the class will attempt to use the smallest numeric types to fit the data in the CSV. For example, for integer values, short will be used if all values are in the range of -32768 to 32767. If larger values are seen, the class will move to int and eventually long. The default behavior, without -bf, is to use long for all integer columns and double for all floating point columns.
-tr or --trimSimilar to the TRIM file format, but adds leading/trailing whitespace trimming to any format. So, for a comma-delimited file with extra whitespace, -ff TRIM would be sufficient, but for a file using something other than a comma as its delimiter, the -tr option would be used in addition to -ff or -fd.
-om or --outputModeEither SAFE (default) or REPLACE. When SAFE, the schema creator will exit with an error if an existing file would be overwritten by the schema being generated. When set to REPLACE, a pre-existing file will be overwritten.

JDBC

By default, the JDBC schema generator requests the data types of the result set columns from the data source. These data types are mapped to Deephaven data types to provide the schema details.

For cases where the data source cannot provide column data type information, or where the source database is not ideally configured to store the data (e.g., numbers or date/times in String fields), the JDBC schema generator provides options to scan data as is done in the CSV schema generator, or to use a combination of data scanning and metadata.

This example creates ExNamespace.Trade.schema in the /tmp directory:

iris_exec jdbc_schema_creator -- \
--namespace ExNamespace \
--tableName Trade \
---connectionUrl "jdbc:sqlserver://<host>;database=<db>;user=<user>;password=<pwd>" \
--jdbcDriver "com.microsoft.sqlserver.jdbc.SQLServerDriver" \
--jdbcQuery "SELECT TOP 1000 * FROM trade" \
--schemaPath /tmp

Note that we limit the size of the result to avoid pulling the entire table into memory.

Command Reference

iris_exec jdbc_schema_creator <launch args> -- <schema creator args>

The following arguments are available when running the JDBC schema creator:

ArgumentDescription
-ns or --namespace <namespace>(Required) The namespace to use for the new schema.
-tn or --tableName <name>(Required) The table name to use for the new schema.
-sp or --schemaPath <path>An optional path to which the schema file will be written. If not specified, this defaults to the current working directory and will create or use a subdirectory that matches the namespace.
-jd or --driverClass <fully qualified class name of the JDBC driver>The fully-qualified class name of the driver to load to connect to the data source. In most cases, the corresponding jar file will need to be available in the classpath of the schema creator process.
-cu or --connectionUrl <data source specific connection URL>Information needed to find and connect to the data source. See below for more details.
-jq or --jdbcQuery <SQL query to execute against the data source>The full query string to execute against the data source. In most cases, this will need to be delimited with double-quotes.
-am or --assessmentMethod <META_ONLY, DATA_ONLY, or BOTH>How to evaluate column types from the JDBC data source.
  • META_ONLY (default) uses the metadata of the result set that is provided by the JDBC driver, mapping JDBC data types to Deephaven data types.
  • DATA_ONLY scans the data from the result set similarly to how the CSV schema creator scans data, which is useful for data sources that do not support many data types, those that do not provide metadata, or for situations where data is stored non-optimally in the source (e.g., date/time values in string columns).
  • BOTH will use metadata first, and then validate and refine the results by scanning the actual data.
-pc or --partitionColumnOptional name for the partitioning column if schema is being generated. If this is not provided, the importer will default to Date for the partitioning column name. Any existing column from the source that matches the name of the partitioning column will be renamed to "source_[original column name]".
-gc or --groupingColumnOptional column name that should be marked as columnType="Grouping" in the schema. If multiple grouping columns are needed, the generated schema should be manually edited to add the Grouping designation to the additional columns.
-spc or --sourcePartitionColumnOptional column to use for multi-partition imports. For example, if the partitioning column is "Date" and you want to enable multi-partition imports based on the column source_date, specify "source_date" with this option (Note: This is the column name in the data source, not the Deephaven column name).
-sn or --sourceName <name for ImportSource block>Optional name to use for the generated ImportSource block in the schema. If not provided, the default of "IrisJDBC" will be used.
-lp or --logProgressIf present, additional informational logging will be provided with progress updates during the parsing process.
-bf or --bestFitIf present, the class will attempt to use the smallest numeric types to fit the data. For example, for integer values, short will be used if all values are in the range of -32768 to 32767. If larger values are seen, the class will then move to int and eventually long. The default behavior, without -bf, is to use long for all integer columns and double for all floating point columns.
-om or --outputModeEither SAFE (default) or REPLACE. When SAFE, the schema creator will exit with an error if an existing file would be overwritten by the schema being generated. When set to REPLACE, a pre-existing file will be overwritten.
-is or --importStyleEither BATCH (default) or CDC_REPLICATION. CDC_REPLICATION indicates the schema should be configured to use a SQLReplicationLogger as the logger interface and take a SQLReplicationRow object as input data. Reasonable defaults for converting SQL types to database types will also be generated for each column.
-op or --outPackageThe output package name (i.e., "com.yourcompany.iris.db.gen") when generating loggers and listeners. Required if generating loggers & listeners.
-lf or --logFormat(Optional) Default unspecified. Logger format number. Used to version log files if table schema is changed.
-un or --useNanos(Optional) Defaults to true. If true, each DBDateTime column will indicate timePrecision="Nanos", and the logger will expect a long value with nanos. If false, the loggers and listeners will expect millisecond values.
-sc or --sourceCasingOptional unless destination casing is given. Specifies the casing to expect for source column names. If both this and destination casing are specified, column names will be converted accordingly. Legal values are: UPPER_UNDERSCORE, LOWER_UNDERSCORE, UPPER_CAMEL, LOWER_CAMEL, and LOWER_HYPHEN.
-dc or --destCasingOptional unless source casing is given. Determines the Deephaven column name format. If both this and destination casing are specified, column names will be converted accordingly. Legal values are: UPPER_UNDERSCORE, LOWER_UNDERSCORE, UPPER_CAMEL, LOWER_CAMEL, and LOWER_HYPHEN.

JSON

JSON is a hierarchical format representing data objects and arrays. In order to convert this format to a Deephaven table, JSON data is recursively converted into a stream of lines equivalent to a CSV file (with one "line" per top-level JSON object). The type inference then operates in the same was as it does on CSV data.

The input file should contain plain JSON objects, not comma-separated or an array. For example, the following is a legal file with three objects, which will result in three rows:

{a:1} {a:2} {a:3}

You may have multiple JSON objects on a single line, and/or the JSON objects may span multiple lines.

Column names for a table inferred from JSON are derived from the field names contained in the JSON objects. Since there is no "header" in JSON, all or a large portion of the data is explored to enumerate all columns (since any one JSON object may be missing a given field). JSON array elements are converted into column names by combining the field name with a numeric suffix reflecting the position in the array.

Nested objects are recursively expanded, with the column name for each field derived from the full "path" to that field. The path separator can be specified with the --fieldPathSeparator argument. Array elements are numbered with a zero-based index.

For example, the following maps to a schema with the columns "a", "b_c", "b_d", "e0", "e1", "e2" and a row with values 1, 2, 3, 4, 5, 6:

{ "a":1, "b":{ "c":2, "d":3 }, "e":[4,5,6] }

This example creates JSONExampleNamespace.JSONExampleTableName.schema in the /tmp directory:

iris_exec json_schema_creator -- --namespace JSONExampleNamespace --tableName JSONExampleTableName --sourceFile /data/sample.json --schemaPath /tmp

Command Reference

iris_exec json_schema_creator <launch args> -- <schema creator args>

The JSON schema creator takes the following arguments:

ArgumentDescription
-ns or --namespace <namespace>(Required) Namespace in which to find the target table.
-tn or --tableName <name>(Required) Name of the target table.
sp or --schemaPath <path>An optional path to which the schema file will be written. If not specified, this defaults to the current working directory and will create or use a subdirectory that matches the namespace.
-sf or --sourceFile <exact file name>(Required) The name of the JSON file to read (required).
-pc or --partitionColumnOptional name for the partitioning column if schema is being generated. If not provided, the importer will default to Date for the partitioning column name. Any existing column from the source that matches the name of the partitioning column will be renamed to "source_[original column name]".
-gc or --groupingColumnOptional column name that should be marked as columnType="Grouping" in the schema. If multiple grouping columns are needed, the generated schema should be manually edited to add the Grouping designation to the additional columns.
-spc or --sourcePartitionColumnOptional column to use for multi-partition imports. For example if the partitioning column is "Date" and you want to enable multi-partition imports based on the column source_date, specify "source_date" with this option (Note: This is the column name in the data source, not the Deephaven column name).
-sn or --sourceName <name for ImportSource block>Optional name to use for the generated ImportSource block in the schema. If not provided, the default of "IrisJSON" will be used.
-sl or --skipLines <integer value>Optional number of lines to skip from the beginning of the file before expecting the header row. If not provided, the first line is used as the header row.
-lp or --logProgressIf present, additional informational logging will be provided with progress updates during the parsing process.
-bf or --bestFitIf present, the class will attempt to use the smallest numeric types to fit the data. For example, for integer values, short will be used if all values are in the range of -32768 to 32767. If larger values are seen, the class will then move to int and eventually long. The default behavior, without -bf, is to use long for all integer columns and double for all floating point columns.
-om or --outputModeEither SAFE (default) or REPLACE. When SAFE, the schema creator will exit with an error if an existing file would be overwritten by the schema being generated. When set to REPLACE, a pre-existing file will be overwritten.
-fps or --fieldPathSeparatorIf present, specifies a character or string to use when generating column names from nested JSON data. Defaults to underscore ("_").

XML

XML by itself does not provide or guarantee a particular layout to the data. Simpler data sets that use XML typically have one element per record, are not hierarchical, and use either attributes or element values to store record values. More complex XML feeds may include metadata, common data updates, and hierarchically related records. Deephaven provides "generic" XML data import capabilities that can accommodate most of the simpler forms. The more complex XML data feeds, including those that include data that must be imported to multiple tables in a single operation, require custom importers that are specific to the data format encapsulated within the XML (e.g., FIXML).

Although XML documents can include metadata, there is no standard as to how such column data types and other such information should be formatted. Therefore, for generic XML imports, Deephaven treats XML files similarly to how it handles CSV files. To find column data types, the schema generation tool analyzes all data that it finds in a sample data file. In addition, if named (rather than positional) values are used, the schema generator will also scan all importable elements to find column names that are needed for the table.

This example creates XMLExampleNamespace.XMLExampleTableName.schema in the /tmp directory:

iris_exec xml_schema_creator -- --namespace XMLExampleNamespace --tableName XMLExampleTableName --sourceFile /data/sample.xml --elementType Record --useAttributeValues --schemaPath /tmp

Command Reference

iris_exec xml_schema_creator <launch args> -- <schema creator args>

The following arguments are available when running the XML schema creator:

ArgumentDescription
-ns or --namespace <namespace>(Required) The namespace to use for the new schema.
-tn or --tableName <name>(Required) The table name to use for the new schema.
-sp or --schemaPath <path>An optional path to which the schema file will be written. If not specified, this defaults to the current working directory and will create or use a subdirectory that matches the namespace.
-sf or --sourceFile <file name or file path and name>(Required) The name of the XML file to read. This file must have a header row with column names.
-xi or --startIndexStarting from the root of the document, the index (1 being the first top-level element in the document after the root) of the element under which data can be found.
-xd or --startDepthUnder the element indicated by Start Index, how many levels of first children to traverse to find an element that contains data to import.
-xm or --maxDepthStarting from Start Depth, how many levels of element paths to traverse and concatenate to provide a list that can be selected under Element Name.
-xt or --elementTypeThe name or path of the element that will contain data elements. This is case-sensitive.
-ev or --useElementValuesIndicates that field values will be taken from element values; e.g., <Price>10.25</>
-av or --useAttributeValuesIndicates that field values will be taken from attribute values; e.g., <Record ID="XYZ" Price="10.25" />
-pv or --namedValuesPositional Values: When omitted, field values within the document will be named; e.g., a value called Price might be contained in an element named Price, or an attribute named Price. When this option is included, field names (column names) will be taken from the table schema, and the data values will be parsed into them by matching the position of the value with the position of column in the schema.
-pc or --partitionColumnOptional name for the partitioning column if schema is being generated. If not provided, the importer will default to "Date" for the partitioning column name. Any existing column from the source that matches the name of the partitioning column will be renamed to "source_[original column name]".
-gc or --groupingColumnOptional column name that should be marked as columnType="Grouping" in the schema. If multiple grouping columns are needed, the generated schema should be manually edited to add the Grouping designation to the additional columns.
-spc or --sourcePartitionColumnOptional column to use for multi-partition imports. For example, if the partitioning column is "Date" and you want to enable multi-partition imports based on the column "source_date", specify "source_date" with this option (this is the column name in the data source, not the Deephaven column name).
-sn or --sourceName <name for ImportSource block>Optional name to use for the generated ImportSource block in the schema. If not provided, the default of "IrisXML" will be used.
-sl or --skipHeaderLines <integer value>Optional number of lines to skip from the beginning of the file before expecting the header row. If not provided, the first line is used as the header row.
-fl or --setSkipFooterLines <integer value>Optional number of footer lines to skip from the end of the file.
-lp or --logProgressIf present, additional informational logging will be provided with progress updates during the parsing process.
-bf or --bestFitIf present, the class will attempt to use the smallest numeric types to fit the data in the XML. For example, for integer values, short will be used if all values are in the range of -32768 to 32767. If larger values are seen, the class will move to int and eventually long. The default behavior, without -bf, is to use long for all integer columns and double for all floating point columns.
-om or --outputModeEither SAFE (default) or REPLACE. When SAFE, the schema creator will exit with an error if an existing file would be overwritten by the schema being generated. When set to REPLACE, a pre-existing file will be overwritten.

Kafka

Deephaven can generate schemas from Avro schema and Protobuf descriptors. For examples and an explanation of how the schema inference works, see Discovering a Deephaven Schema from an Avro Schema and Discovering a Deephaven Schema from a Protocol Buffer Descriptor.