Schema ImportSource

The ImportSource component of a schema provides additional metadata to control and modify the import process.

<Table name="JDBCNames" namespace="Import" storageType="NestedPartitionedOnDisk">
     <ImportSource name="JDBC1" type="JDBC">
         <ImportColumn name="FirstName" sourceName="first_name" />
         <ImportColumn name="LastName" sourceName="last_name" />
         <ImportColumn name="Gender" sourceName="gender" />
         <ImportColumn name="Race" sourceName="race" />
         <ImportColumn name="FirstNameLength" sourceName="first_name" formula="FirstNameLength.length()" sourceType="String" />
    </ImportSource>

<Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__"/>
     <Column name="Date" intradayType="none" dataType="String" columnType="Partitioning" />
     <Column name="FirstName" dataType="String" columnType="Normal" />
     <Column name="LastName" dataType="String" columnType="Grouping" />
     <Column name="Gender" dataType="String" columnType="Normal" />
     <Column name="Race" dataType="String" columnType="Normal" />
     <Column name="FirstNameLength" dataType="int" columnType="Normal" />
</Table>

Multiple ImportSource blocks can be added to allow various import instructions for different sources or types of import. The type attribute is required; currently supported values are "CSV", "JDBC", "JSON", and "XML". The name attribute is optional. When the import runs, if no ImportSource name is specified, the importer will use the first ImportSource block that matches its type. If no matching block is found (for type, and, if specified, for name), then the import will attempt to auto-match columns as it would if no ImportSource blocks existed.

To support multi-partition imports, you must have an ImportColumn for the partition column indicating how to partition each row (e.g., <ImportColumn name="Date" sourceName="trade_date"/> where trade_date is the CSV/JDBC column name).

An ImportSource block can contain multiple types of entries, including:

ImportColumn
Imports
maxError
arrayDelimiter
ImportTransform

These options are described below.

ImportColumn

Each entry describes how to map values for one column from the source data to the target table's schema.

Attribute	Description
`name`	The name of the target column in the Deephaven table, which must exactly match the name attribute of one of this schema's Column elements. This attribute is required; others are optional. Note: An `ImportColumn` element with no attributes beside `name` provides no special handling for that column and is effectively ignored.
`sourceName`	The name of the column in the source data whose values should be mapped into the target column specified by name. This is used for mapping from a different column name in the source table, such as when a source name is not a valid Deephaven column name. Also use this attribute on the partition column when configuring multi-partition imports.
`sourceType`	The data type of the source data. This is used when data types must be translated during the import process. If `sourceType` is not specified, then the source data is expected to have the type specified in the Column element that corresponds to the `ImportColumn`. As an example, `sourceType="long"` may be required when importing a timestamp stored as a long value to a `DBDateTime` column. Such a conversion cannot be done implicitly and requires a `sourceType` to tell the importer to read the source value as a long. Note: `sourceType` is case-sensitive: `long` is a primitive long value, while `Long` is interpreted as `java.lang.Long`, an object type.
`nullable`	Indicates whether null values will be accepted for this column. If the `nullable` attribute is present, it must be "true" or "false". False is the default. A column marked as nullable can be skipped during the import (i.e., if it doesn't exist in the source), and be set to null in all rows. A nullable column will also receive a null value when a null value is read from the source or no value is read from the source. Note: for primitive types, which do not natively support null values, the database uses reserved values to represent null. Null default values from `QueryConstants` will be used. If a default is defined, the default will be used instead of a null value.
`formula`	This allows specification of Java code to do simple manipulations of data based on the source value itself. In the example noted earlier, the `FirstNameLength ImportColumn` uses a formula to calculate its value. Note that the value being read during the import will be passed to the formula using the `ImportColumn`'s name. Formulas can use built-in Java methods, like `replace()` for strings, functions that are built into the importer classes, or functions imported from other packages (see: Imports). When compiling the field writer, the importer will enclose the formula in a try/catch block, in case methods being called within the formula require exception handling. A caught exception will be rethrown as a `RuntimeException` with a message indicating that the exception was caught in a formula, and with the original message appended.
`transform`	Similar to `formula`, `transform` creates a new function during the import using the details of the `ImportColumn` element. Transforms create Java Lambda functions that are then applied to the input value. An example of a built-in transform function is `dbDateTimeFromLong`. This function will create a Lambda function that converts a long to a `DBDateTime`, and will use another attribute (`transformPrecision`) to select whether the resultant function expects milliseconds, microseconds, or nanoseconds from Unix epoch.
`default`	The value to use when import or parsing of the source value fails or the source is null. The default value must be type compatible with the target column type; for boolean columns, this would be lowercase "true" or "false." The literal "null" may be used for a default, and will result in an appropriately typed null value (from `com.illumon.iris.db.tables.utils.QueryConstants` for primitive types, or an actual null value for object types. A specified default takes precedence over null handling when a column is defined as nullable. In the special case where the literal null, lowercase, is desired as the default value for a String, this could be accomplished with a default that sets some known value, or null, and a formula to replace the value with the literal String null.
`class`	The full name of a custom class to use as the field writer for the column. Custom field writer classes must derive from the `CsvFieldWriter` or `JdbcFieldWriter` class, as appropriate to the import type. If `class` is specified, `formula`, `transform`, `sourceName`, and `sourceType` are ignored and not used by the importer itself, though they may be used within the custom field writer.
`cacheable`	(CSV only) If this attribute is set to "false", caching will be disabled for this column. If this attribute is not present, or is set to "true", caching will be enabled, assuming the column meets one of the other criteria for caching (complex non-String column type, has a formula, or has a transform.) Caching is useful for values that require conversion or calculation work during the import, but may not be helpful if values in the column are rarely duplicated.
`cacheSize`	(CSV only) Allows overriding the cache size for columns that will be cached. The default size is 1024 entries.
____	Other attributes as needed to control transform functions (e.g., dbDateTimeFromTimestamp uses `transformPrecision` to control its behavior). See details below.

Imports

The packages to import. A default set of packages is imported to support built-in functions and import tasks, but additional packages can be specified to provide other functions or support custom transforms. Imports can be just the qualified package/class name, or the full import qualified.package.name.class; line.

Beside the internal imports used by the field writers, these imports are automatically added in the field writer context:

java.util.Properties
com.illumon.iris.db.tables.utils.DBDateTime
com.illumon.iris.db.tables.utils.DBTimeUtils
com.fishlib.util.DateUtil
java.util.regex.Pattern
java.time.format.DateTimeFormatter
java.time.*

maxError

The maximum number of allowed parsing errors before failing the import. The default is 0. Allowed errors that will increment maxError are things like numeric conversions, overflows, and mismatched values (e.g., a Boolean value being set from a String column that doesn't match any known version of True or False.)

arrayDelimiter

A string to be used as a delimiter when parsing array data. The default delimiter is a comma.

ImportTransform

See Custom Transforms below.

name- The name of a custom transform function, and will be the name sought when an ImportColumn specifies a custom transform function.
body - The Java code that will be compiled as part of the setter for an ImportColumn that uses the custom transform.

Custom transforms

Custom transforms can be added to the ImportSource block as ImportTransform elements. These elements must:

Have a Body element that contains the text of the transform function.
Have a name attribute that matches the name used in the transform attribute of ImportColumns that use the transform.
Implement a getFunction function that returns a function matching the specification of the interface.

At runtime, getFunction is passed to the ImportColumnDefinition of the column that is being set, and a boolean value indicating whether strict (true) or permissive (false) data checking has been selected. The ImportColumnDefinition allows getFunction to access details like the target column name, the target column type, and any ImportColumn details by using getImportColumn(), which provides access to functions such as getAttributeValue().

The following shows an example of a custom transform element that could be specified in the schema (Note: this is actually the built-in dbDateTimeFromLong transform):

<ImportTransform name="Sample">
<Body>
   getFunction(ImporterColumnDefinition col, boolean strict)
     {
     String precision = col.getImportColumn().getAttributeValue("transformPrecision");
     switch (precision) {
     case "millis":
return  (long value) -> new DBDateTime(DBTimeUtils.millisToNanos(value));
     case "micros":
return (long value) ->  new DBDateTime(DBTimeUtils.microsToNanos(value));
     case "nanos": return DBDateTime::new;
     default: return null;
     }
</Body>
</ImportTransform>

When compiling the transform function, the target, and, if defined, source, data types are used to find a suitable Java functional interface with which to declare the getFunction. In the case of the example above, the target column type is DBDateTime, and the sourceType="long" attribute is added to the ImportColumn. This combination of source and target types is then used to create and execute the resultant function as a LongFunction<DBDateTime>.

A much simpler example follows:

<ImportTransform name="Sample">
<Body>
getFunction(ImporterColumnDefinition column, Boolean strict) {
return (long value) -> (2 * value);
}
</Body>
</ImportTransform>

If this is used for a column whose type is long, this maps to a LongUnaryOperator, which takes a long and returns a Long. If the sourceType is not specified, the importer attempts to use the target column type for matching a function interface.

If the function being returned from getFunction has a declared return type, this return type must exactly match the target column type for which the transform will be used. For example, a transform that will provide values for a Long column and that has a declared return type, must be declared Long, not long. However, the simpler, and recommended, implementation is to include only the transform body in the return from getFunction, so the system can perform implicit type conversion when compiling the transform.

Custom transforms can be added from the Schema Editor File menu using the View/Edit Schema XML option.

Primitive types and boxed types in transforms

Not every combination of primitive types is supported by the Lambda interfaces used to pass transform functions into field writers. The supported combinations for unboxed primitive types are any combination of source and target of int, double, and long.

Therefore, a transform that reads a long value from the source and transforms it to a double value in the target table would use the primitive types of long and double.

However, a transform that reads a byte value and writes a short value would not be able to use primitive types. The importer will map this transform as a Function<Byte, Short>. Note the capitalization. Byte and Short are the boxed (class-based) versions of the primitive types byte and short. This means there is extra overhead in the transform process to use the larger class-based versions of the data types.

Note

This concern with boxing of primitive types is only applicable to transforms. If the data can be implicitly converted (e.g., map a short to an int), or converted with a formula, then no boxing occurs. If the translation is too complex for a formula, another option for such primitive types is to write a custom FieldWriter. Since the FieldWriter classes are not interfaced through Lamdba functions, they have no limitations on which primitive data types they can use for source or target.

Another thing to be aware of is that primitive types that have a "null" default will be set to values from com.illumon.iris.db.tables.utils.QueryConstants when their sources are null. While these may be boxed when they are passed into a transform, they will not be automatically converted to true null values. If special handling is needed for null values in the transform, either check for equality to the QueryConstants values, or use a boxed type for the sourceType property of the column so these values will be preserved as null.

Array types

The importers support long[] and double[] array types. From the source, these are expected to be strings (either a string field from JDBC or a column value in CSV). The string is formatted as a bracketed list of values. Brackets can be square brackets [ ], parentheses ( ), or curly braces { }. The delimiter can be specified with arrayDelimiter, but the default delimiter is a comma. Empty arrays [] are also supported.

An example array of double strings follows:

[1.5, 6.4, 8.0, 2.2322, 88, 16.02]