Schema ImportSource

The ImportSource component of a schema provides additional metadata to control and modify the import process.

<Table name="JDBCNames" namespace="Import" storageType="NestedPartitionedOnDisk">
     <ImportSource name="JDBC1" type="JDBC">
         <ImportColumn name="FirstName" sourceName="first_name" />
         <ImportColumn name="LastName" sourceName="last_name" />
         <ImportColumn name="Gender" sourceName="gender" />
         <ImportColumn name="Race" sourceName="race" />
         <ImportColumn name="FirstNameLength" sourceName="first_name" formula="FirstNameLength.length()" sourceType="String" />
     </ImportSource>

<Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__"/>
     <Column name="Date" intradayType="none" dataType="String" columnType="Partitioning" />
     <Column name="FirstName" dataType="String" columnType="Normal" />
     <Column name="LastName" dataType="String" columnType="Grouping" />
     <Column name="Gender" dataType="String" columnType="Normal" />
     <Column name="Race" dataType="String" columnType="Normal" />
     <Column name="FirstNameLength" dataType="int" columnType="Normal" />
</Table>

Multiple ImportSource blocks can be added to allow various import instructions for different sources or types of import. The type attribute is required; currently supported values are "CSV", "JDBC", "JSON", and "XML". The name attribute is optional. When the import runs, if no ImportSource name is specified, the importer will use the first ImportSource block that matches its type. If no matching block is found (for type, and, if specified, for name), then the import will attempt to auto-match columns as it would if no ImportSource blocks existed.

To support multi-partition imports, you must have an ImportColumn for the partition column indicating how to partition each row (e.g., <ImportColumn name="Date" sourceName="trade_date"/> where trade_date is the CSV/JDBC column name).

An ImportSource block can contain multiple types of entries, including:

ImportColumn
Imports
maxError
arrayDelimiter
ImportTransform

These options are described below.

ImportColumn

Each entry describes how to map values for one column from the source data to the target table's schema.

Attribute	Description
`name`	The name of the target column in the Deephaven table, which must exactly match the name attribute of one of this schema's Column elements. This attribute is required; others are optional. Note: An `ImportColumn` element with no attributes beside `name` provides no special handling for that column and is effectively ignored.
`sourceName`	The name of the column in the source data whose values should be mapped into the target column specified by name. This is used for mapping from a different column name in the source table, such as when a source name is not a valid Deephaven column name. Also use this attribute on the partition column when configuring multi-partition imports.
`sourceType`	The data type of the source data. This is used when data types must be translated during the import process. If `sourceType` is not specified, then the source data is expected to have the type specified in the Column element that corresponds to the `ImportColumn`. For example, `sourceType="long"` may be required when importing a timestamp stored as a long value into a `DBDateTime` column. Such a conversion cannot be done implicitly and requires a `sourceType` to tell the importer to read the source value as a long. Note: `sourceType` is case-sensitive: `long` is a primitive long value, while `Long` is interpreted as `java.lang.Long`, an object type.
`nullable`	Indicates whether null values will be accepted for this column. If the `nullable` attribute is present, it must be "true" or "false". False is the default. A column marked as nullable can be skipped during import (i.e., if it doesn't exist in the source) and set to null in all rows. A nullable column will also receive a null value when a null value is read from the source, or no value is read from the source. Note: for primitive types, which do not natively support null values, the database uses reserved values to represent null. Null default values from `QueryConstants` will be used. If a default is defined, it is used instead of a null value.
`formula`	This allows specification of Java code to do simple manipulations of data based on the source value itself. In the example noted earlier, the `FirstNameLength ImportColumn` uses a formula to calculate its value. Note that the value being read during the import will be passed to the formula using the `ImportColumn`'s name. Formulas can use built-in Java methods, like `replace()` for strings, functions that are built into the importer classes, or functions imported from other packages (see: Imports). When compiling the field writer, the importer encloses the formula in a try/catch block, in case methods being called within the formula require exception handling. A caught exception is rethrown as a `RuntimeException` with a message indicating that the exception was caught in a formula, and with the original message appended.
`transform`	Similar to `formula`, `transform` creates a new function during the import using the details of the `ImportColumn` element. Transforms create Java Lambda functions that are then applied to the input value. An example of a built-in transform function is `dbDateTimeFromLong`. This function will create a Lambda function that converts a long to a `DBDateTime` and uses another attribute (`transformPrecision`) to select whether the resultant function expects milliseconds, microseconds, or nanoseconds from the Unix epoch.
`default`	The value to use when import or parsing of the source value fails, or the source is null. The default value must be type compatible with the target column type; for boolean columns, this would be lowercase "true" or "false." The literal "null" may be used for a default, and results in an appropriately typed null value (from `com illumon.iris.db.tables.utils.QueryConstants` for primitive types, or an actual null value for object types. A specified default takes precedence over null handling when a column is defined as nullable. In the special case where the literal null, lowercase, is desired as the default value for a String, this could be accomplished with a default that sets some known value, or null, and a formula to replace the value with the literal String null.
`class`	The full name of a custom class to use as the field writer for the column. Custom field writer classes must derive from the `CsvFieldWriter` or `JdbcFieldWriter` class, as appropriate to the import type. If `class` is specified, `formula`, `transform`, `sourceName`, and `sourceType` are ignored and not used by the importer itself, though they may be used within the custom field writer.
`cacheable`	(CSV only) If this attribute is set to "false", caching is disabled for this column. If this attribute is not present, or is set to "true", caching is enabled, assuming the column meets one of the other criteria for caching (complex non-String column type, has a formula, or has a transform.) Caching is useful for values that require conversion or calculation work during the import, but it may not be helpful if values in the column are rarely duplicated.
`cacheSize`	(CSV only) Allows overriding the cache size for columns that will be cached. The default size is 1024 entries.
`csvParserCreatorClass`	(CSV only) Allows the use of a custom `deephaven-csv` parser class to parse specific columns from a CSV file.
____	Other attributes as needed to control transform functions (e.g., `dbDateTimeFromTimestamp` uses `transformPrecision` to control its behavior). See details below.

Imports

The packages to import. A default set of packages is imported to support built-in functions and import tasks, but additional packages can be specified to provide other functions or support custom transforms. Imports can be just the qualified package/class name, or the full import qualified.package.name.class; line.

Beside the internal imports used by the field writers, these imports are automatically added in the field writer context:

java.util.Properties
com.illumon.iris.db.tables.utils.DBDateTime
com.illumon.iris.db.tables.utils.DBTimeUtils
com.fishlib.util.DateUtil
java.util.regex.Pattern
java.time.format.DateTimeFormatter
java.time.*

maxError

The maximum number of allowed parsing errors before failing the import. The default is 0. Allowed errors that will increment maxError are things like numeric conversions, overflows, and mismatched values (e.g., a Boolean value being set from a String column that doesn't match any known version of True or False.)

arrayDelimiter

A string to be used as a delimiter when parsing array data. The default delimiter is a comma.

Import Transformers

Import transformers define custom-written Java functions to transform input data into the Deephaven format.

Custom transforms

Custom transforms can be added to the ImportSource block as ImportTransform elements. These elements must:

Have a Body element that contains the text of the transform function. This is the Java code that will be compiled as part of the setter for an ImportColumn that uses the custom transform.
Have a name attribute that matches the name used in the transform attribute of ImportColumns that use the transform.
Implement a getFunction function that returns a function matching the specification of the interface.
- The ImportColumn's sourceType indicates the data type of the lambda parameter, in this case value.
- The column's dataType indicates the data type which the returned lambda (or function) must return.

At runtime, getFunction is passed to the ImportColumnDefinition of the column that is being set, and a boolean value indicating whether strict (true) or permissive (false) data checking has been selected. The ImportColumnDefinition allows getFunction to access details like the target column name, the target column type, and any ImportColumn details by using getImportColumn(), which provides access to functions such as getAttributeValue().

The following shows an example of a custom transform element that could be specified in the schema (Note: this is similar to the built-in dbDateTimeFromLong transform):

<ImportTransform name="DateTimeFromLong">
    <Body>
        getFunction(ImporterColumnDefinition col, boolean strict) {
            String precision = col.getImportColumn().getAttributeValue("transformPrecision");
            switch (precision) {
            case "millis":
                return  (long value) -> new DBDateTime(DBTimeUtils.millisToNanos(value));
            case "micros":
                return (long value) ->  new DBDateTime(DBTimeUtils.microsToNanos(value));
            case "nanos":
                return DBDateTime::new;
            default:
                return null;
            }
        }
    </Body>
</ImportTransform>

When compiling the transform function, the target, and, if defined, source, data types are used to find a suitable Java functional interface with which to declare the getFunction. In the case of the example above, the target column type is DBDateTime, and the sourceType="long" attribute is added to the ImportColumn. This combination of source and target types is then used to create and execute the resultant function as a LongFunction<DBDateTime>.

The ImportColumn definition in the ImportSource can now refer to the DateTimeFromLong transformer defined above, specifying the sourceName, sourceType, and the required transformPrecision field.

<ImportColumn name="datetime_from_long" sourceName="nanos_column" sourceType="long" transform="DateTimeFromLong" transformPrecision="nanos" />

A much simpler example follows:

<ImportTransform name="DoubleLongValue">
    <Body>
        getFunction(ImporterColumnDefinition column, Boolean strict) {
            return (long value) -> (2 * value);
        }
    </Body>
</ImportTransform>

If this is used for a column whose type is long, this maps to a LongUnaryOperator, which takes a long and returns a Long. If the sourceType is not specified, the importer attempts to use the target column type for matching a function interface.

If the function being returned from getFunction has a declared return type, this return type must exactly match the target column type for which the transform will be used. For example, a transform that will provide values for a Long column and that has a declared return type, must be declared Long, not long. However, the simpler, and recommended, implementation is to include only the transform body in the return from getFunction, so the system can perform implicit type conversion when compiling the transform.

Custom transforms can be added from the Schema Editor File menu using the View/Edit Schema XML option.

Primitive types and boxed types in transforms

Not every combination of primitive types is supported by the Lambda interfaces used to pass transform functions into field writers. The supported combinations for unboxed primitive types are any combination of source and target of int, double, and long.

Therefore, a transform that reads a long value from the source and transforms it to a double value in the target table would use the primitive types of long and double.

However, a transform that reads a byte value and writes a short value would not be able to use primitive types. The importer will map this transform as a Function<Byte, Short>. Note the capitalization. Byte and Short are the boxed (class-based) versions of the primitive types byte and short. This means there is extra overhead in the transform process to use the larger class-based versions of the data types.

Note

This concern with boxing of primitive types is only applicable to transforms. If the data can be implicitly converted (e.g., map a short to an int), or converted with a formula, then no boxing occurs. If the translation is too complex for a formula, another option for such primitive types is to write a custom FieldWriter. Since the FieldWriter classes are not interfaced through Lamdba functions, they have no limitations on which primitive data types they can use for source or target.

Another thing to be aware of is that primitive types that have a "null" default will be set to values from com.illumon.iris.db.tables.utils.QueryConstants when their sources are null. While these may be boxed when they are passed into a transform, they will not be automatically converted to true null values. If special handling is needed for null values in the transform, either check for equality to the QueryConstants values, or use a boxed type for the sourceType property of the column so these values will be preserved as null.

Array types

The importers support long[] and double[] array types. From the source, these are expected to be strings (either a string field from JDBC or a column value in CSV). The string is formatted as a bracketed list of values. Brackets can be square brackets [ ], parentheses ( ), or curly braces { }. The delimiter can be specified with arrayDelimiter, but the default delimiter is a comma. Empty arrays [] are also supported.

An example array of double strings follows:

[1.5, 6.4, 8.0, 2.2322, 88, 16.02]

Transformation examples

These examples walk through data transformation for a JDBC import. They've been tested against Microsoft SQL Server, using a column called date with type date.

It’s important to test transformations to ensure the expected results are achieved before deploying schemas and Persistent Queries to production servers.

Built-in transformations

The simplest way to transform the source's date column is to define a column using a standard supported type. For this example's remote date column, you can create java.time.Instant or java.time.LocalDate columns by defining the Deephaven columns with those standard types in the schema.

<Column name="source_date" dataType="Instant" />
<Column name="date_localdate" dataType="java.time.LocalDate" />

Then the ImportSource definitions for those columns specify the JDBC table's column to use.

<ImportColumn name="source_date" sourceName="date" />
<ImportColumn name="date_localdate" sourceName="date" />

For the source_date column, each value in the remote server's date column is transformed into an Instant, which is a point-in-time. This transformation uses midnight at the server's configured time zone to construct that point-in-time.

For the date_localdate column the conversion is trivial, because LocalDate contains the same information as the remote server's date column. Each value is simply converted into its equivalent LocalDate.

Complex formula

The Instant values can be converted into a String column. First define a standard String column in the schema.

<Column name="date_string" dataType="String" />

This shows a working but intentionally inefficient transformation to point out some pitfalls. Define the formula in the ImportSource.

<ImportColumn name="date_string" sourceName="date" sourceType="Instant" formula="java.time.format.DateTimeFormatter.ofPattern(&quot;yyyy-MM-dd&quot;).withZone(java.time.ZoneOffset.UTC).format(date_string)" />

sourceName specifies the remote database's column name.
sourceType indicates the type as determined during the transformation from the remote source - we know Instant works because we used it in the built-in example.
formula uses a standard Java DateTimeFormatter to convert that Instant into a String of the format yyyy-MM-dd.
- The usual quotes around the formatting string (in this case "yyyy-MM-dd") have been changed to ", because schema files are XML and must conform to XML syntax rules. XML requires " instead of the double-quote character within attributes.
- The code within the formula is run for every value in the date column. This means that every parsed value constructs a new DateTimeFormatter instance. While this technically works, it is incredibly inefficient and not recommended.

The result of this formula is a string such as 2025-12-31.

`ImportTransform`

For a complex transformation like the conversion from an Instant above, using an ImportTransform gives more control, allowing better definition of the control using Java code. As before, define a standard String column in the schema.

<Column name="date_transform_string" dataType="String" />

The schema's ImportSource section will include two pieces that work together to perform the transformation. First, define a function.

<ImportTransform name="DateFromInstant">
    <Body>
        getFunction(ImporterColumnDefinition col, boolean strict) {
            // Construct a single `DateTimeFormatter` instance. The same instance is used for each row.
            java.time.format.DateTimeFormatter formatter = java.time.format.DateTimeFormatter.ofPattern(&quot;yyyy-MM-dd&quot;).withZone(java.time.ZoneOffset.UTC);

            // Return a lambda with an `Instant` parameter that returns a `String`, based on the source-column type and return-column type. This is the same as a `Function<Instant, String>`.
            // The lambda (or Function) is executed for each row.
            return (Instant value) -> formatter.format(value);
        }
    </Body>
</ImportTransform>

This transformer is named DateFromInstant.
Body must start with the getFunction method declaration as shown above.
- It's not specifically defined in the <Body> block, but getFunction returns a Java Function that's compiled into the importer's setter code. In this case, the Function takes an Instant and returns a String. based on the column-type of the source and destination columns.
- The ImportColumn definition (below) ensures that the Function's lambda operates on the Instant datatype. This is why the return statement uses (Instant value) -> for the lambda.
- The Column definition (above) specifies that it's returning a String since that's the column's dataType.
The function is called once for the column and added to the column's setter logic. This means that the DateTimeFormatter is constructed only once (when getFunction is called during the column's definition logic), not for every row in the source table.

Now the ImportColumn definition in the ImportSource can simply refer to the DateFromInstant transformer defined above, specifying the sourceName and sourceDate.

<ImportColumn name="date_transform_string" sourceName="date" sourceType="Instant" transform="DateFromInstant" />

sourceName specifies the remote server's column name.
sourceType indicates the original type as determined during the transformation from the remote source - we know Instant works because we used it in the built-in example.

Partitioning from data

String columns can be used to determine the intraday partitions into which data is imported. Since partitioning by date is common, this column definition frequently appears in a schema and applies to our example.

<Column name="Date" dataType="String" columnType="Partitioning" />

Continuing the ImportTransform example, when defining the import Persistent Query's options, choose "Multi Partition" for the import target, and "date_transform_string" for the import partition column. The Persistent Query will automatically partition the data based on the date values determined for that column.

Custom CSV parsers

The deephaven-csv library allows for the use of custom parser classes when importing a file and the Core+ CSV import supports this.

Create a class that implements the io.deephaven.importers.csv.parsers.CsvParserCreator interface, which specifies a method called to create a io.deephaven.importers.csv.Parser.
Specify the parser creation class in an ImportColumn definition with the csvParserCreatorClass attribute.
This parser-creator's create method is called by the importer to create the custom parser for the column.

Here's an example of an ImportColumn definition that defines a custom CSV parser.

<ImportColumn name="Timestamp" sourceName="timestamp" sourceType="String" csvParserCreatorClass="io.deephaven.importers.csv.DateTimeFormatterPatternParserCreator" timeZone="America/Denver" timeFormat="yyyy-MM-dd'T'HH:mm:ss.SSS" />

Note

Custom parsers don't support the maxError logic described above and should provide their own error handling if required.

DateTimeFormatterPatternParserCreator

io.deephaven.importers.csv.DateTimeFormatterParser uses standard Java DateTimeFormatter patterns to parse data. It's created by specifying the io.deephaven.importers.csv.DateTimeFormatterPatternParserCreator creation class in the schema's ImportColumn element. Two attributes define its behavior:

timeFormat provides the java.time.format.DateTimeFormatter pattern used to parse the data. For example, timeFormat="yyyy-MM-dd'T'HH:mm:ss.SSS" parses dates-times such as 2024-04-01T00:05:12.123.
An optional timeZone defines the time zone to be used for the time conversions. If it's not included, the format pattern must indicate how to derive the time zone. For example, timeZone="America/Denver".

TimeWithPartitionParserCreator

io.deephaven.importers.csv.TimeWithPartitionParserCreator creates an io.deephaven.importers.csv.DateTimeFormatterParser to parse timestamps, creating the resulting values relative to the column partition's date value. This date must be in standard yyyy-MM-dd format, such as 2024-12-20.

The time's format must be specified with the timeFormat attribute.
The timezone is specified by the timeZone attribute in the column's import source. If a time zone isn't specified, it uses the system's default time zone.

For instance, if an ImportColumn includes csvParserCreatorClass="io.deephaven.importers.csv.TimeWithPartitionParserCreator" timeZone="America/New_York" timeFormat="HHmmssnnnnnnnnn, if the partition is the date 2026-02-01, then the value 081918938069760 parses to the date-time 2026-02-01 08:19:18.938069760 New York time.