Schema ImportSource
The ImportSource
component of a schema provides additional metadata to control and modify the import process.
<Table name="JDBCNames" namespace="Import" storageType="NestedPartitionedOnDisk">
<ImportSource name="JDBC1" type="JDBC">
<ImportColumn name="FirstName" sourceName="first_name" />
<ImportColumn name="LastName" sourceName="last_name" />
<ImportColumn name="Gender" sourceName="gender" />
<ImportColumn name="Race" sourceName="race" />
<ImportColumn name="FirstNameLength" sourceName="first_name" formula="FirstNameLength.length()" sourceType="String" />
</ImportSource>
<Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__"/>
<Column name="Date" intradayType="none" dataType="String" columnType="Partitioning" />
<Column name="FirstName" dataType="String" columnType="Normal" />
<Column name="LastName" dataType="String" columnType="Grouping" />
<Column name="Gender" dataType="String" columnType="Normal" />
<Column name="Race" dataType="String" columnType="Normal" />
<Column name="FirstNameLength" dataType="int" columnType="Normal" />
</Table>
Multiple ImportSource
blocks can be added to allow various import instructions for different sources or types of import. The type attribute is required; currently supported values are "CSV", "JDBC", "JSON", and "XML". The name attribute is optional. When the import runs, if no ImportSource
name is specified, the importer will use the first ImportSource
block that matches its type. If no matching block is found (for type, and, if specified, for name), then the import will attempt to auto-match columns as it would if no ImportSource
blocks existed.
To support multi-partition imports, you must have an ImportColumn
for the partition column indicating how to partition each row (e.g., <ImportColumn name="Date" sourceName="trade_date"/>
where trade_date
is the CSV/JDBC column name).
An ImportSource
block can contain multiple types of entries, including:
These options are described below.
ImportColumn
Each entry describes how to map values for one column from the source data to the target table's schema.
Attribute | Description |
---|---|
name | The name of the target column in the Deephaven table, which must exactly match the name attribute of one of this schema's Column elements. This attribute is required; others are optional. Note: An ImportColumn element with no attributes beside name provides no special handling for that column and is effectively ignored. |
sourceName | The name of the column in the source data whose values should be mapped into the target column specified by name. This is used for mapping from a different column name in the source table, such as when a source name is not a valid Deephaven column name. Also use this attribute on the partition column when configuring multi-partition imports. |
sourceType | The data type of the source data. This is used when data types must be translated during the import process. If sourceType is not specified, then the source data is expected to have the type specified in the Column element that corresponds to the ImportColumn . As an example, sourceType="long" may be required when importing a timestamp stored as a long value to a DBDateTime column. Such a conversion cannot be done implicitly and requires a sourceType to tell the importer to read the source value as a long. Note: sourceType is case-sensitive: long is a primitive long value, while Long is interpreted as java.lang.Long , an object type. |
nullable | Indicates whether null values will be accepted for this column. If the nullable attribute is present, it must be "true" or "false". False is the default. A column marked as nullable can be skipped during the import (i.e., if it doesn't exist in the source), and be set to null in all rows. A nullable column will also receive a null value when a null value is read from the source or no value is read from the source. Note: for primitive types, which do not natively support null values, the database uses reserved values to represent null. Null default values from QueryConstants will be used. If a default is defined, the default will be used instead of a null value. |
formula | This allows specification of Java code to do simple manipulations of data based on the source value itself. In the example noted earlier, the FirstNameLength ImportColumn uses a formula to calculate its value. Note that the value being read during the import will be passed to the formula using the ImportColumn 's name. Formulas can use built-in Java methods, like replace() for strings, functions that are built into the importer classes, or functions imported from other packages (see: Imports). When compiling the field writer, the importer will enclose the formula in a try/catch block, in case methods being called within the formula require exception handling. A caught exception will be rethrown as a RuntimeException with a message indicating that the exception was caught in a formula, and with the original message appended. |
transform | Similar to formula , transform creates a new function during the import using the details of the ImportColumn element. Transforms create Java Lambda functions that are then applied to the input value. An example of a built-in transform function is dbDateTimeFromLong . This function will create a Lambda function that converts a long to a DBDateTime , and will use another attribute (transformPrecision ) to select whether the resultant function expects milliseconds, microseconds, or nanoseconds from Unix epoch. |
default | The value to use when import or parsing of the source value fails or the source is null. The default value must be type compatible with the target column type; for boolean columns, this would be lowercase "true" or "false." The literal "null" may be used for a default, and will result in an appropriately typed null value (from com.illumon.iris.db.tables.utils.QueryConstants for primitive types, or an actual null value for object types. A specified default takes precedence over null handling when a column is defined as nullable. In the special case where the literal null, lowercase, is desired as the default value for a String, this could be accomplished with a default that sets some known value, or null, and a formula to replace the value with the literal String null. |
class | The full name of a custom class to use as the field writer for the column. Custom field writer classes must derive from the CsvFieldWriter or JdbcFieldWriter class, as appropriate to the import type. If class is specified, formula , transform , sourceName , and sourceType are ignored and not used by the importer itself, though they may be used within the custom field writer. |
cacheable | (CSV only) If this attribute is set to "false", caching will be disabled for this column. If this attribute is not present, or is set to "true", caching will be enabled, assuming the column meets one of the other criteria for caching (complex non-String column type, has a formula, or has a transform.) Caching is useful for values that require conversion or calculation work during the import, but may not be helpful if values in the column are rarely duplicated. |
cacheSize | (CSV only) Allows overriding the cache size for columns that will be cached. The default size is 1024 entries. |
____ | Other attributes as needed to control transform functions (e.g., dbDateTimeFromTimestamp uses transformPrecision to control its behavior). See details below. |
Imports
The packages to import. A default set of packages is imported to support built-in functions and import tasks, but additional packages can be specified to provide other functions or support custom transforms. Imports can be just the qualified package/class name, or the full import qualified.package.name.class;
line.
Beside the internal imports used by the field writers, these imports are automatically added in the field writer context:
java.util.Properties
com.illumon.iris.db.tables.utils.DBDateTime
com.illumon.iris.db.tables.utils.DBTimeUtils
com.fishlib.util.DateUtil
java.util.regex.Pattern
java.time.format.DateTimeFormatter
java.time.*
maxError
The maximum number of allowed parsing errors before failing the import. The default is 0. Allowed errors that will increment maxError
are things like numeric conversions, overflows, and mismatched values (e.g., a Boolean value being set from a String column that doesn't match any known version of True or False.)
arrayDelimiter
A string to be used as a delimiter when parsing array data. The default delimiter is a comma.
ImportTransform
See Custom Transforms below.
name
- The name of a custom transform function, and will be the name sought when anImportColumn
specifies a custom transform function.body
- The Java code that will be compiled as part of the setter for anImportColumn
that uses the custom transform.
Custom transforms
Custom transforms can be added to the ImportSource
block as ImportTransform
elements. These elements must:
- Have a
Body
element that contains the text of the transform function. - Have a
name
attribute that matches the name used in thetransform
attribute ofImportColumns
that use the transform. - Implement a
getFunction
function that returns a function matching the specification of the interface.
At runtime, getFunction
is passed to the ImportColumnDefinition
of the column that is being set, and a boolean value indicating whether strict (true
) or permissive (false
) data checking has been selected. The ImportColumnDefinition
allows getFunction
to access details like the target column name, the target column type, and any ImportColumn
details by using getImportColumn()
, which provides access to functions such as getAttributeValue()
.
The following shows an example of a custom transform element that could be specified in the schema (Note: this is actually the built-in dbDateTimeFromLong
transform):
<ImportTransform name="Sample">
<Body>
getFunction(ImporterColumnDefinition col, boolean strict)
{
String precision = col.getImportColumn().getAttributeValue("transformPrecision");
switch (precision) {
case "millis":
return (long value) -> new DBDateTime(DBTimeUtils.millisToNanos(value));
case "micros":
return (long value) -> new DBDateTime(DBTimeUtils.microsToNanos(value));
case "nanos": return DBDateTime::new;
default: return null;
}
</Body>
</ImportTransform>
When compiling the transform function, the target, and, if defined, source, data types are used to find a suitable Java functional interface with which to declare the getFunction
. In the case of the example above, the target column type is DBDateTime
, and the sourceType="long"
attribute is added to the ImportColumn
. This combination of source and target types is then used to create and execute the resultant function as a LongFunction<DBDateTime>
.
A much simpler example follows:
<ImportTransform name="Sample">
<Body>
getFunction(ImporterColumnDefinition column, Boolean strict) {
return (long value) -> (2 * value);
}
</Body>
</ImportTransform>
If this is used for a column whose type is long, this maps to a LongUnaryOperator
, which takes a long and returns a Long. If the sourceType
is not specified, the importer attempts to use the target column type for matching a function interface.
If the function being returned from getFunction
has a declared return type, this return type must exactly match the target column type for which the transform will be used. For example, a transform that will provide values for a Long
column and that has a declared return type, must be declared Long
, not long
. However, the simpler, and recommended, implementation is to include only the transform body in the return from getFunction
, so the system can perform implicit type conversion when compiling the transform.
Custom transforms can be added from the Schema Editor File menu using the View/Edit Schema XML option.
Primitive types and boxed types in transforms
Not every combination of primitive types is supported by the Lambda interfaces used to pass transform functions into field writers. The supported combinations for unboxed primitive types are any combination of source and target of int
, double
, and long
.
Therefore, a transform that reads a long
value from the source and transforms it to a double
value in the target table would use the primitive types of long
and double
.
However, a transform that reads a byte
value and writes a short
value would not be able to use primitive types. The importer will map this transform as a Function<Byte, Short>
. Note the capitalization. Byte
and Short
are the boxed (class-based) versions of the primitive types byte
and short
. This means there is extra overhead in the transform process to use the larger class-based versions of the data types.
Note
This concern with boxing of primitive types is only applicable to transforms. If the data can be implicitly converted (e.g., map a short
to an int
), or converted with a formula, then no boxing occurs. If the translation is too complex for a formula, another option for such primitive types is to write a custom FieldWriter
. Since the FieldWriter
classes are not interfaced through Lamdba functions, they have no limitations on which primitive data types they can use for source or target.
Another thing to be aware of is that primitive types that have a "null" default will be set to values from com.illumon.iris.db.tables.utils.QueryConstants
when their sources are null. While these may be boxed when they are passed into a transform, they will not be automatically converted to true null values. If special handling is needed for null values in the transform, either check for equality to the QueryConstants
values, or use a boxed type for the sourceType
property of the column so these values will be preserved as null.
Array types
The importers support long[]
and double[]
array types. From the source, these are expected to be strings (either a string field from JDBC or a column value in CSV). The string is formatted as a bracketed list of values. Brackets can be square brackets [ ]
, parentheses ( )
, or curly braces { }
. The delimiter can be specified with arrayDelimiter
, but the default delimiter is a comma. Empty arrays []
are also supported.
An example array of double strings follows:
[1.5, 6.4, 8.0, 2.2322, 88, 16.02]