Tables & schemas

A Deephaven table schema is an XML document that describes a table. A schema must include a table's namespace, name, column names and data types. In addition to specifying the structure of the data, a schema can include directives controlling how data is imported and stored.

Schema files can be created by directly constructing XML files (e.g., using a text editor), or can be generated using built-in tools. In most cases, the easiest way to create or edit a schema file is by using the Deephaven Schema Editor tool. The Schema Editor itself can be used to create a new schema from scratch, edit an existing schema, or derive a new schema from example data.

If a schema is created manually, this is typically done by picking an existing schema file to use as a starting point, and then copying, pasting, and modifying it to create a suitable layout for the new table. This does not require any special tools, but it can be tedious for complex tables, or when many tables are needed. To simplify this process, Deephaven includes tools that can analyze data sets and infer column details and import instructions, and that can also generate and validate sections and entries within schema files. The schema creation tools make schema creation faster and easier, and help avoid issues related to formatting and syntax. These tools are described briefly in the section on Schema Inference, and in detail in the chapter covering each specific data type (CSV, XML, etc.).

Once a schema is defined, it must be deployed to take effect. If the table is intended for use by streaming data, an additional step is required to generate loggers and listeners. This is discussed further in Deploying Schemas to Deephaven below.

Note

See: This part of the documentation covers general schema requirements and features. Additional optional sections of schema are detailed in Importing Batch Data and Importing Streaming Data.

Basic Table Attributes

Metadata is imported to the database using XML-based schema files that contain definitions like the following examples:

<Table namespace="ExampleNamespace" name="ExampleTable" storageType="NestedPartitionedOnDisk">
   <Partitions keyFormula="${autobalance_single}"/>
   <Column name="Date" dataType="String" columnType="Partitioning" />
   <Column name="Alpha" dataType="String" columnType="Grouping"/>
   <Column name="Bravo" dataType="byte" />
   <Column name="Charlie" dataType="int" />
   <Column name="Delta" dataType="double" />
   <Column name="Echo" dataType="java.util.List" />
</Table>

<Table namespace="Ise" name="Quotes" storageType="NestedPartitionedOnDisk" defaultMergeFormat="DeephavenV1">
   <Partitions keyFormula="${autobalance_by_first_grouping_column}"/>
   <Column name="Date" dataType="String" columnType="Partitioning" />
   <Column name="Ticker" dataType="String" columnType="Grouping" />
   <Column name="Size" dataType="int" columnType="Normal" />
   <Column name="Price" dataType="double" columnType="Normal" />
</Table>

<Table namespace="Korea" name="Securities" storageType="SplayedOnDisk">
   <Column name="UnderlyingId" dataType="int" columnType="Grouping" />
   <Column name="Id" dataType="int" columnType="Normal" />
   <Column name="Ticker" dataType="String" columnType="Normal" />
   <Column name="SecurityType" dataType="String" columnType="Normal" />
   <Column name="Expiration" dataType="DateTime" columnType="Normal" />
</Table>

Each schema file contains definitions for exactly one table, and must follow the naming convention <namespace>.<table>.schema. For example, the above examples would be named "Ise.Quotes.schema" and "Korea.Securities.schema".

The following attributes are necessary for all Table elements:

Attribute	Description
`namespace`	The system namespace (e.g., Ise or Korea) to which this table belongs. A namespace is analogous to a folder on the filesystem — they are used to organize the data, and one namespace typically contains many tables (just as a folder typically contains many files).
`name`	The name of this table (e.g., Quotes or Securities).
`storageType`	Specifies the way data for this table will be organized on disk. Options: `NestedPartitionedOnDisk` - a hierarchically-partitioned format, suitable for nearly all use cases (the remainder of this document applies primarily to this format). `SplayedOnDisk` - a table that exists in exactly one location, without any of the features enabled by partitioning (near real-time import, storage load balancing, etc).
`defaultMergeFormat`	Specifies the format in which merged (historical) data will be written. `DeephavenV1` - the data will be written in the proprietary Deephaven format. `Parquet` - the data will be written in Parquet format.

Columns

The <Column> elements contain the detailed information to define each table's columns, including the column's name, data type, and role as a partitioning or grouping column.

At minimum these elements must contain the following attributes:

Attribute	Optional/Required	Default	Description
`name`	Required	N/A	The name of the column as it will be stored in the database.
`dataType`	Required	N/A	The type of the data stored in a column. This can be one of the following: A Java class name, including the package. Of particular note are `BigDecimal`, `BigInteger` and `byte[]`, since these types are supported with special codecs (see below). One of the following type names: DateTime — A date and time, stored as a DBDateTime object. String — A Java `String`. StringSet — An efficient Deephaven collection for storing multiple Strings. Note that StringSets are limited to 64 value per internal partition and these values cannot be changed once the partition is created. Boolean — A Java `Boolean` object. In order to support null values, boolean values are stored with Java's boxed boolean type (`java.lang.Boolean`), not the primitive type (`boolean`). byte — A Java primitive `byte`. char — A Java primitive `char`. double — A Java primitive double. float — A Java primitive `float`. integer — A Java primitive `int`. long — A Java primitive `long`. short — A Java primitive `short`. An array of any of the above data types, specified with square brackets, for example `short[]`. Note: Deephaven also supports other friendly names for some types; for instance, `Integer` will be automatically treated as `int`. This treatment does not apply, however to array types. An array of 32-bit values must be declared as `int[]`.
`columnType`	Optional	`Normal`	The role of the column in the table, one of the following: Normal — A standard column with no special attributes. Partitioning — A column whose value is used to organize data in the file system. The data for each partition is stored in a separate directory, named for the partition value. The partitioning column is not written as a column file to disk, since its value can be inferred from the directory name. Note that Every schema with `storageType="NestedPartitionedOnDisk"` must specify a partitioning column. Partitioning columns must have `dataType="String"`. Grouping — A grouping column used when reordering data during the merge process. See Partition Key Formulas for details on the function of Partitioning and Grouping columns.
`objectCodec`	Optional	None	A java class which controls how the column is stored on disk. See below for details on the provided codecs: `com.illumon.util.codec.BigDecimalCodec` `com.illumon.util.codec.BigIntegerCodec` `com.illumon.util.codec.ByteArrayCodec`
`objectCodecArguments`	Optional	None	Argument(s) provided to the objectCodec. How this value is interpreted depends on which codec is specified.
`symbolTable`	Optional	`ColumnLocation`	For String columns only, this specifies whether or not a symbol table will be used to store the data. Supported values are: `ColumnLocation` - for tables with a limited number of possible unique values, this stores the unique values one time and stores only a reference to the values. This can save space in the table. `None` - for columns with a large number of unique values, this stores each value as its own entry in the column. If it is not provided, the value of `ColumnLocation` is used. See: Symbol Tables
`encoding`	Optional	`ISO_8859_1`	For String columns only, this determines the encoding used to store the value. Supported values are: `US_ASCII` `ISO_8859_1` `UTF_8` `UTF_16BE` `UTF_16LE` `UTF_16` If it is not provided, a default encoding of `ISO_8859_1` is used.

Column Codecs

Column codecs are used to provide custom deserialization logic for complex data types.

Codecs convert between raw bytes (i.e., the Java type byte[]) and the column's data type (as specified by the dataType attribute). The output type of the codec must match the column's type. Additional codecs can be provided by adding custom implementations of the com.illumon.util.codec.ObjectCodec<TYPE> interface to the classpath (for example, in /etc/sysconfig/illumon.d/java_lib).

The following column codecs are provided with Deephaven. For columns of these types, it is much more space efficient to use the corresponding codec than to rely on the default java serialization.

Class Name	Column Type	Codec Description	Arguments	Argument Descriptions
`com.illumon.util.codec.BigDecimalCodec`	`java.lang.BigDecimal`		`[<precision> [, <scale>]]`	The precision argument specifies the maximum number of decimal digits that may be stored. The scale argument specifies the maximum number of decimal digits after decimal point. If precision is specified but scale is not, a scale of `0` is inferred. If precision is specified, a space-efficient fixed size encoding is used. Otherwise, a variable size encoding is used, with a maximum precision of `1000`. Null values are permitted.
`com.illumon.util.codec.BigIntegerCodec`	`java.lang.BigInteger`		`[<precision>]`	If specified, precision is used to specify the maximum number of decimal digits storable by this column. This codec uses the same encoding as BigDecimalCodec with a scale of zero. Null values are permitted.
`com.illumon.util.codec.ByteArrayCodec`	`byte[]`		`[<size> [, nullable \| notnull]]`	If size is specified, values are encoded as fixed-size values. If not, a variable-size encoding is used. A flag indicating nullability may also be specified. The default is `nullable`. Specifying `notnull` saves one byte per value.
`com.illumon.util.codec.StringBooleanMapCodec`	`java.util.Map<String, Boolean>`	Encodes a map with `String` keys and `Boolean` values.	None	N/A
`com.illumon.util.codec.StringDoubleMapCodec`	`java.util.Map<String, Double>`	Encodes a map with `String` keys and `Double` values.	None	N/A
`com.illumon.util.codec.StringFloatMapCodec`	`java.util.Map<String, Float>`	Encodes a map with `String` keys and `Float` values.	None	N/A
`com.illumon.util.codec.StringLongMapCodec`	`java.util.Map<String, Long>`	Encodes a map with `String` keys and `Long` values.	None	N/A
`com.illumon.util.codec.StringIntMapCodec`	`java.util.Map<String, Int>`	Encodes a map with `String` keys and `Integer` values.	None	N/A
`com.illumon.util.codec.StringStringMapCodec`	`java.util.Map<String, String>`	Encodes a map with `String` keys and `String` values.	None	N/A

The following example schema includes columns using the BigDecimalCodec, BigIntegerCodec and ByteArrayCodec codecs described above.

<Table name="CodecsExampleTable" namespace="ExampleNamespace" storageType="NestedPartitionedOnDisk">
   <Partitions keyFormula="${autobalance_single}"/>
   <Column name="Date" dataType="String" columnType="Partitioning" />
   <Column name="Alpha" dataType="String" columnType="Grouping"/>
   <Column name="Bravo" dataType="int" />
   <Column name="Charlie" dataType="double" />
   <Column name="Delta" dataType="double" />
   <Column name="Echo" dataType="StringSet" />
   <Column name="Foxtrot" dataType="java.math.BigDecimal" objectCodec="com.illumon.util.codec.BigDecimalCodec" objectCodecArguments="13,7"/>
   <Column name="Golf" dataType="java.math.BigInteger" objectCodec="com.illumon.util.codec.BigIntegerCodec" objectCodecArguments="13"/>
   <Column name="Hotel" dataType="byte[]" objectCodec="com.illumon.util.codec.ByteArrayCodec" objectCodecArguments="32,notnull"/>
</Table>

For streaming data ingestion, the schema must also be updated so that the loggers and listeners use the codec. Please see the section on Column codecs in loggers and listeners for more details.

Partitions

Deephaven has two distinct uses of the term "partition" as it applies to schemas. Both are required for partitioned tables (i.e., those using the NestedPartitionedOnDisk storage type).

A partitioning column is a String column that must contain an identifiable value for each set of data appended to the table. This is required for all partitioned tables.
The historical storage partitions, along with the partitioning key formula, are used to share table data across multiple storage locations as the data is merged into the historical database. This is required for all partitioned tables.

Typically, date strings (such as 2023-01-31) are used for the partitioning column value, but any string that meaningfully partitions the data can be used. Table data is stored on disk in separate directories based on the partitioning column value, which allows queries to filter data very efficiently based on partition. For example, if a table is partitioned by date, then a filter such as .where("Date=`2023-01-31`", "Sym=`AAPL`") would only need to read data from directories for the date 2023-01-31. Since most queries must filter by date when initially retrieving data, it is generally recommended to use dates as partitioning columns. Further, the ISO 8601 date format of yyyy-MM-dd (e.g., 2023-01-31) is recommended. Unlike other date formats, the yyyy-MM-dd format can easily be sorted because its lexicographical order matches its chronological order.

Partitioning columns are specified by using the columnType="Partitioning" attribute in a column definition. Currently, partitioned tables must have exactly one such column. Further information on column definitions is provided below.

When writing merged historical table data to disk, Deephaven will place partitions of the input data in a directory identified by the namespace, storage partition, column partition, and table name.

Partition Key Formulas

Historical storage partitions correspond to directories available for historical storage on the disk; for scaling and performance benefits, these are normally mapped to different physical storage devices, whether local or network-connected.

Historical storage partitions are specified using the Partitions element, which uses the keyFormula attribute to map table rows to the available storage partitions. Specifically, this formula is used by merge operations in an updateView operation, the result of which is used to assign rows to a storage partition .

The following QueryScope parameters are available as input to the formula, in addition to the columns of the table:

Parameter	Description
`__WRITABLE_PARTITIONS__`	An array of partition name strings corresponding to directories found at `/db/Systems/Namespace/WritablePartitions`.
`__NUM_WRITABLE_PARTITIONS__`	The number of available partitions (i.e., `__WRITABLE_PARTITIONS__.length`)
`__NAMESPACE__`	A string representing the namespace.
`__TABLE_NAME__`	A string representing the table name.
`__NS_TN__`	A string representing the namespace and table name, joined with slash as a separator (e.g., `Ise/Quotes` or `Korea/Securities`)
`__PARTITIONING_COLUMN_VALUE__`	The partitioning column value for this merge operation.
`__DAY_OF_YEAR__`	The day of the year, only available when `__PARTITIONING_COLUMN_VALUE__` appears to be a date in `YYYY-MM-DD` form.
`__PARTITION_AUTOBALANCE_SINGLE__`	A complete formula that automatically chooses a single partition for all data based on `__NS_TN__` and `__PARTITIONING_COLUMN_VALUE__`, attempting to achieve balanced partition usage over long time horizons is deprecated in favor of `${autobalance_single}`.

In addition, the following substitutions will be applied, in order to make writing appropriate partitioning formulas easier:

Token	Substitution Description
`${autobalance_single}`	A complete formula that chooses a single partition for all input data, attempting to achieve balanced partition usage over long time horizons. The preferred way to write a formula that only refers to the `__PARTITION_AUTOBALANCE_SINGLE__` query scope parameter.
`${autobalance_by_first_grouping_column}`	A complete formula that uses the first grouping column (if there is one) in order to spread data across all available partitions, attempting to achieve balanced partition usage over long time horizons. Assuming the first grouping column is the most significant, keeps all grouped data together. Falls back to the same behavior as `${autobalance_single}` if there are no grouping columns.
`${autobalance_by_all_grouping_columns}`	A complete formula that uses all grouping columns (if there are any) in order to spread data across all available partitions, attempting to achieve balanced partition usage over long time horizons. Effectively spreads grouped data across multiple partitions, which may be useful for queries. Falls back to the same behavior as `${autobalance_single}` if there are no grouping columns.
`${first_grouping_column}`	A string representing the name of the first grouping column (i.e., the first column definition with the attribute `columnType="Grouping"`). Only available if there are grouping columns.
`${all_grouping_columns}`	An array of strings representing the names of the grouping columns (i.e., column definitions with the attribute `columnType="Grouping"`). Only available if there are grouping columns.

Conventions and Recommendations

Some common conventions follow:

Large Tables: Tables with large amounts of data per column partition typically try to balance across all available storage partitions, while keeping related data together, often choosing to use a grouping column as part of the formula to achieve this. The easiest way to achieve this is to use the following formula, relying on token substitution: ${autobalance_by_first_grouping_column}
For an expanded example, the Ise/Quotes table above might use the following formula: __WRITABLE_PARTITIONS__[abs(Objects.hash(__NS_TN__, Ticker) + __DAY_OF_YEAR__) % __NUM_WRITABLE_PARTITIONS__] This serves to distribute data with the same value for the Ticker column together, while spreading data across all available partitions, taking into account day of year, namespace, and table name in order to approximate long term load balancing.
Small tables: Tables with small amounts of data per column partition often simply use ${autobalance_single}, keeping data in a relatively compact format while utilizing all historical partitions over longer time horizons in order to approximate load balancing.

Merge Attributes

When merging data to historical partitions, Deephaven supports writing to traditional Deephaven files as well as writing to Parquet files. When writing to Parquet, a default compression codec can be chosen by adding a MergeAttributes element with an appropriate Parquet-supported codec. This codec can be changed by specifying a different one in the merge operation. If no codec is chosen, then the SNAPPY codec is used. This element does not apply to merges in Deephaven format.

Note

See: Parquet Codecs

For example, to use the uncompressed codec:

<MergeAttributes format="Parquet" codec="UNCOMPRESSED" />

Data Indexes

In addition to Grouping, Deephaven supports the traditional concept of Indexing historical data. A Data Index is a table on disk which maps each unique key, defined by the set of Index columns, to the indices within the table at which they appear. These tables allow the Deephaven Query Engine to optimize certain operations such as Filtering, Aggregating, and Joining data.

Data Indexes are defined using the <DataIndex columns='...'> XML tag in the schema. Each <DataIndex...> entry directs the Merge process to group the data being merged by the set of columns (called keys) in the columns attribute. The Data Index is then written to disk, sorted in the same order as the columns are specified in the <DataIndex...> tag. Choosing an ordering of these columns based on how they are expected to be used in queries can improve the performance of operations that can take advantage of Data Indexes.

For example, the following snippet defines two separate Data Indexes, one on USym, and one on USym and Exchange:

<DataIndex columns="USym">
<DataIndex columns="USym,Exchange">

Schema inference

Schema inference from source data can be performed from the command line, or through the Schema Editor interface. These are the schema generator tools available, and the command lines used to invoke them. You may use iris_exec to invoke any of the schema creators without arguments, to display the available options. The arguments are also described in detail in the section covering the specific data type (CSV/JDBC/XML/JSON).

Note

See: To learn more about iris_exec commands, please refer to Running Local Scripts.

Command line interface

iris_exec csv_schema_creator <launch args> -- <schema creator args>
iris_exec jdbc_schema_creator <launch args> -- <schema creator args>
iris_exec xml_schema_creator <launch args> -- <schema creator args>
iris_exec json_schema_creator <launch args> -- <schema creator args>

The table schema files created from these processes also support merge operations, and allow specification of a grouping column and/or the partitioning column name. After a schema file is generated, it is highly recommended the user review the schema and the log file, and update details as needed. You must deploy your schema and generate loggers/listeners. See below for more details on the contents of a generated schema file and the types of changes that might be needed.

CSV

CSV refers to a variety of text file data formats. Although they are called "comma-separated", the delimiter could in fact be some other character (such as a tab or semicolon). Other details of formatting are similarly flexible. There is no single official standard for CSV formatting, so Deephaven provides flexibility when defining how to interpret and transform text data from CSV files into structured storage in Deephaven tables.

Because CSV is a text file format that provides no metadata, the schema generation tool must analyze the data that it finds in a sample data file. Typically column names are derived from column headers, but there are options for CSV files with and without headers. The schema generation process can be simplified and accelerated somewhat by using a sample of data, as long as there are enough rows to clearly demonstrate the formats and ranges of data values that will be included in normal imports.

For non-String columns, the schema generator will scan all the values in the supplied CSV file to ensure the inferred data type remains valid for all values in the column. As an example, the first value for a column might be 4. This would cause the schema generator to tentatively categorize this column as an integer type (short or long, depending on settings). If a 5.5 was encountered in a subsequent row, the generator would "promote" the column to a floating point datatype. If another value in the column included non-numeric characters, such as XYZ, the generator would then mark the column as a String column. Once a column has been designated as a String column, no other checking is done, as Strings are always valid to accept any data from a CSV.

If a column is empty for all rows, it is marked as a String and a warning message is written to the log. In this case, the administrator should examine the generated schema and validate the generated XML before adding it to Deephaven. Date/time values are also a special case, where the schema generator will attempt to match the format of the date/time string to a known format for which it can provide a conversion formula. Like the earlier example with numbers and Strings, if a date/time column has multiple different formats in it, it will end up being marked as a String column, since no single formula could be found that matched all of the column's data.

Note

See: Importing CSV Files

JDBC

Java Database Connectivity is a standard that describes how data store drivers provide common Java interfaces to a wide variety of databases and other data storage mechanisms. Unlike CSV, JDBC provides data type information for columns, and the abilities to filter and manipulate data using a SQL query.

In most cases it is easier to generate schema files from JDBC than from CSV because of the result set metadata that JDBC data sources can provide. The default behavior of the JDBC schema generator is to request data types of the result set columns from the data source. These data types are then mapped to Deephaven data types to provide the schema details.

For cases where the data source cannot provide column data type information, or where the source database is not ideally configured to store the data (e.g., numbers or date/times in String fields), the JDBC schema generator provides options to scan data as is done in the CSV schema generator, or to use a combination of data scanning and metadata.

Note

See: Importing from JDBC

XML

XML (eXtensible Markup Language) is a container format that can contain sequential and/or hierarchical data. Like CSV, XML by itself does not provide or guarantee a particular layout to the data. Simpler data sets that use XML typically have one element per record, are not hierarchical, and use either attributes or element values to store record values. More complex XML feeds may include metadata, common data updates, and hierarchically related records. Deephaven provides "generic" XML data import capabilities that can accomodate most of the simpler forms. The more complex XML data feeds, including those that include data that must be imported to multiple tables in a single operation, require custom importers that are specific to the data format encapsulated within the XML (e.g., FIXML).

Although XML documents can include metadata, there is no standard as to how such column data types and other such information should be formatted. Therefore, for generic XML imports, Deephaven treats XML files similarly to how it handles CSV files. To find column data types, the schema generation tool analyzes all data that it finds in a sample data file. In addition, if named (rather than positional) values are used, the schema generator will also scan all importable elements to find column names that will be needed for the table.

Note

See: Importing XML Files

JSON

JSON is a hierarchical format representing data objects and arrays. In order to convert this format to a Deephaven table, JSON data is recursively converted into a stream of lines equivalent to a CSV file (with one "line" per top-level JSON object). The type inference then operates in the same was as it does on CSV data.

Column names for a table inferred from JSON are derived from the field names contained in the JSON objects. Since there is no "header" in JSON, all or a large portion of the data is explored to enumerate all columns (since any one JSON object may be missing a given field). JSON array elements are converted into column names by combining the field name with a numeric suffix reflecting the position in the array.

Note

Importing JSON Files

From a Table

A schema can be inferred from the column names and types of a Deephaven table using the SchemaService. The following commands create and deploy a schema for the system table ExampleNamespace.TableName based on the definition of table.

import com.illumon.iris.db.schema.SchemaServiceFactory
import com.illumon.iris.db.schema.NamespaceSet
import com.illumon.iris.db.tables.TableDefinition

namespace = "ExampleNamespace"
tableName = "TableName"
table = newTable(stringCol("Letter", "A", "C", "F", "B", "E", "D", "A"),
        intCol("Number", NULL_INT, 2, 1, NULL_INT, 4, 5, 3),
        stringCol("Color", "red", "blue", "orange", "purple", "yellow", "pink", "blue"),
        intCol("Code", 12, 13, 11, NULL_INT, 16, 14, NULL_INT))

schemaService = SchemaServiceFactory.getDefault()

if(!db.getSystemNamespaces().contains(namespace)) {
  schemaService.createNamespace(NamespaceSet.SYSTEM, namespace)
}
schema = schemaService.fromDefinition(table.getDefinition(), namespace, tableName,
        TableDefinition.STORAGETYPE_NESTEDPARTITIONEDONDISK, NamespaceSet.SYSTEM)
schemaService.addSchema(schema)

Converters and adding custom converters

The schema creators have a number of built-in converters to recognize various date/time formats. Unrecognized formats will result in a date/time field being treated as String. If ambiguous data is included, such that multiple converters match all the data values, the column will also be treated as String, and a warning message will be written to the log.

Built-in Converter	Matching Data
ISO format with date only	2017-08-30
ISO UTC date/time with optional sub-second precision (no T)	2017-08-30 11:59:59.000Z
ISO UTC date/time with optional sub-second precision (with T)	2017-08-30T11:59:59.000Z
ISO UTC two digit offset date/time with millisecond precision	2017-08-30T11:59:59.000+04
ISO UTC four digit offset date/time with millisecond precision	2017-08-30T11:59:59.000+0400
ISO UTC two digit offset date/time with microsecond precision	2017-08-30T11:59:59.000123+04
ISO UTC four digit offset date/time with microsecond precision	2017-08-30T11:59:59.000123+0400
ISO time zone date/time with optional sub-second precision	2017-08-30T11:59:59.000 NY
Exchange timestamp (will be converted to NY time)	2017-08-30T11:59:59.000
Timestamp UTC with optional sub-second precision	2017-08-30 11:59:59 or 2017-08-30 11:59:59.123456
SQL Server default format	AUG 30 2017 11:59AM

Converters for other formats can be included by adding their details to a property file used by the schema creator process.

Here is an example converter that could be added in a property file:

SchemaCreator.converter.name.1=USDateTimeSeconds1

SchemaCreator.converter.pattern.1=^((0?\\d)|(1[0-2]))\\/(([0-2]?\\d)|(3[0-1]))\\/\\d{4} (([0-1]?\\d)|(2[0-3])):[0-5]\\d:[0-5]\\d$

SchemaCreator.converter.formula.1=new DBDateTime(new java.text.SimpleDateFormat("MM/dd/yyyy HH:mm:ss").parse($TARGETNAME$).getTime()*1000000)
SchemaCreator.converter.sourceType.1=String

SchemaCreator.converter.targetType.1=DateTime

There are five components that comprise the definition for the custom converter. All five are required. Note the format of the property names: SchemaCreator.<atrribute name>.<unique index of the converter>. The index value must match across the five properties. The values do not have to start at 1, or be sequential, but it is important to ensure index numbers across multiple property files do not collide.

These five pieces can appear in any order and do not have to be contiguous, but all five pieces must be present if any one of them is present. If an incomplete converter entry is detected when a schema creator process starts, the process will fail and will log details about which components of the definition are missing.

name - The name attribute is mainly for convenience in identifying a particular converter.
pattern - The pattern is a regular expression string used to match data the converter can process. It must be very specific so this converter does not overlap any other converters, and so this formula is not given any data that it cannot process. The example regex limits the month portion of strings it will match to numbers less than 13, which helps avoid matches to non-US date strings where the day would be written before the month. When trying to find a converter to use for complex data, the schema generator will try to find one and only one converter whose pattern matches all the non-empty values from the sample data. Note that all backslashes in any property values must be escaped by doubling them. In this example, the regex will match date/time strings like this one: 01/11/1975 11:10:23
formula - The formula contains a Java expression that will be used to convert the source data to the target data type. It must include the $TARGETNAME$token, which will be replaced with the column name by the schema generator. In this case, a Java SimpleDateFormat is being used to parse the date/time string into a Java Date. Then the .getTime() method is used to get an epoch milliseconds value for the Date. Deephaven uses epoch nanoseconds, so this value is then multiplied by one million and passed to the DBDateTime creator method to return a DBDateTime.
sourceType - The sourceType controls how the importer will interpret a value before passing it to the formula. It will usually be String, but if the source value was already an epoch value, for instance, then the source type might be long instead of String.
targetType - The targetType must match the type produced by the formula. This is used by the schema generator to set the appropriate column type for columns that match the regex pattern.

ImportSource XML in a schema file

Additional metadata to control and modify the import process is added in an ImportSource XML block of the schema. This is demonstrated below:

<Table name="CsvNames5" namespace="Import" storageType="NestedPartitionedOnDisk">
     <ImportSource name="JDBC1" type="JDBC">
         <ImportColumn name="FirstName" sourceName="first_name" />
         <ImportColumn name="LastName" sourceName="last_name" />
         <ImportColumn name="Gender" sourceName="gender" />
         <ImportColumn name="Race" sourceName="race" />
         <ImportColumn name="FirstNameLength" sourceName="first_name" formula="FirstNameLength.length()" sourceType="String" />
    </ImportSource>

<Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__"/>
     <Column name="Date" intradayType="none" dataType="String" columnType="Partitioning" />
     <Column name="FirstName" dataType="String" columnType="Normal" />
     <Column name="LastName" dataType="String"columnType="Grouping" />
     <Column name="Gender" dataType="String" columnType="Normal" />
     <Column name="Race" dataType="String" columnType="Normal" />
     <Column name="FirstNameLength" dataType="int" columnType="Normal" />
</Table>

Multiple ImportSource blocks can be added to allow various import instructions for different sources or types of import. The type attribute is required; currently supported values are "CSV", "JDBC", "JSON", and "XML". The name attribute is optional. When the import runs, if no ImportSource name is specified, the importer will use the first ImportSource block that matches its type. If no matching block is found (for type, and, if specified, for name), then the import will attempt to auto-match columns as it would if no ImportSource blocks existed.

To support multi-partition imports, you must have an ImportColumn for the partition column indicating how to partition each row (e.g., <ImportColumn name="Date" sourceName="trade_date"/> where trade_date is the CSV/JDBC column name).

An ImportSource block can contain multiple types of entries, including:

ImportColumn
Imports
maxError
arrayDelimiter
ImportTransform

These options are described below:

`ImportColumn`

Each entry describes how to map values for one column from the source data to the target table's schema.

Attribute	Description
`name`	The name of the target column in the Deephaven table, which must exactly match the name attribute of one of this schema's Column elements. This attribute is required; others are optional. Note: An `ImportColumn` element with no attributes beside `name` provides no special handling for that column and is effectively ignored.
`sourceName`	The name of the column in the source data whose values should be mapped into the target column specified by name. This is used for mapping from a different column name in the source table, such as when a source name is not a valid Deephaven column name. Also use this attribute on the partition column when configuring multi-partition imports.
`sourceType`	The data type of the source data. This is used when data types must be translated during the import process. If `sourceType` is not specified, then the source data is expected to have the type specified in the Column element that corresponds to the `ImportColumn`. As an example, `sourceType="long"` may be required when importing a timestamp stored as a long value to a DBDateTime column. Such a conversion cannot be done implicitly and requires a `sourceType` to tell the importer to read the source value as a long. Note: `sourceType` is case-sensitive: `long` is a primitive long value, while `Long` is interpreted as `java.lang.Long`, an object type.
`nullable`	Indicates whether null values will be accepted for this column. If the nullable attribute is present, it must be "true" or "false". False is the default. A column marked as nullable can be skipped during the import (i.e., if it doesn't exist in the source ), and be set to null in all rows. A nullable column will also receive a null value when a null value is read from the source or no value is read from the source. Note: for primitive types, which do not natively support null values, the database uses reserved values to represent null. Null default values from QueryConstants will be used. If a default is defined, the default will be used instead of a null value.
`formula`	This allows specification of Java code to do simple manipulations of data based on the source value itself. In the example noted earlier, the `FirstNameLength ImportColumn` uses a formula to calculate its value. Note that the value being read during the import will be passed to the formula using the `ImportColumn`'s name. Formulas can use built-in Java methods, like `replace()` for strings, functions that are built into the importer classes, or functions imported from other packages (see: Imports). When compiling the field writer, the importer will enclose the formula in a try/catch block, in case methods being called within the formula require exception handling. A caught exception will be rethrown as a `RuntimeException` with a message indicating that the exception was caught in a formula, and with the original message appended.
`transform`	Similar to `formula`, `transform` creates a new function during the import using the details of the `ImportColumn` element. Transforms create Java Lambda functions that are then applied to the input value. An example of a built-in transform function is `dbDateTimeFromLong`. This function will create a Lambda function that converts a long to a DBDateTime, and will use another attribute (`transformPrecision`) to select whether the resultant function expects milliseconds, microseconds, or nanoseconds from Unix epoch.
`default`	The value to use when import or parsing of the source value fails or the source is null. The default value must be type compatible with the target column type; for boolean columns, this would be lowercase "true" or "false." The literal "null" may be used for a default, and will result in an appropriately typed null value (from `com.illumon.iris.db.tables.utils.QueryConstants`) for primitive types, or an actual null value for object types. A specified default takes precedence over null handling when a column is defined as nullable. In the special case where the literal null, lowercase, is desired as the default value for a String, this could be accomplished with a default that sets some known value, or null, and a formula to replace the value with the literal String null.
`class`	The full name of a custom class to use as the field writer for the column. Custom field writer classes must derive from the `CsvFieldWriter` or `JdbcFieldWriter` class, as appropriate to the import type. If `class` is specified, `formula`, `transform`, `sourceName`, and `sourceType` are ignored and not used by the importer itself, though they may be used within the custom field writer. See: details below.
`cacheable`	(CSV only) If this attribute is set to "false", caching will be disabled for this column. If this attribute is not present, or is set to "true", caching will be enabled, assuming the column meets one of the other criteria for caching (complex non-String column type, has a formula, or has a transform.) Caching is useful for values that require conversion or calculation work during the import, but may not be helpful if values in the column are rarely duplicated.
`cacheSize`	(CSV only) Allows overriding the cache size for columns that will be cached. The default size is 1024 entries.
____	Other attributes as needed to control transform functions. (e.g., dbDateTimeFromTimestamp uses `transformPrecision` to control its behavior. See details below.

`Imports`

The packages to import. A default set of packages is imported to support built-in functions and import tasks, but additional packages can be specified to provide other functions or support custom transforms. Imports can be just the qualified package/class name, or the full import qualified.package.name.class; line.

Beside the internal imports used by the field writers, these imports are automatically added in the field writer context:

java.util.Properties
com.illumon.iris.db.tables.utils.DBDateTime
com.illumon.iris.db.tables.utils.DBTimeUtils
com.fishlib.util.DateUtil
java.util.regex.Pattern
java.time.format.DateTimeFormatter
java.time.*

`maxError`

The maximum number of allowed parsing errors before failing the import. The default is 0. Allowed errors that will increment maxError are things like numeric conversions, overflows, and mismatched values (e.g., a Boolean value being set from a String column that doesn't match any known version of True or False.)

`arrayDelimiter`

A string to be used as a delimiter when parsing array data. The default delimiter is a comma.

`ImportTransform`

See Custom Transforms below.

name- The name of a custom transform function, and will be the name sought when an ImportColumn specifies a custom transform function.
body - The Java code that will be compiled as part of the setter for an ImportColumn that uses the custom transform.

Custom transforms

Custom transforms can be added to the ImportSource block as ImportTransform elements. These elements must:

Have a Body element that contains the text of the transform function.
Have a name attribute that matches the name used in the transform attribute of ImportColumns that use the transform.
Implement a getFunction function that returns a function matching the specification of the interface.

At runtime, getFunction is passed to the ImportColumnDefinition of the column that is being set, and a boolean value indicating whether strict (true) or permissive (false) data checking has been selected. The ImportColumnDefinition allows getFunction to access details like the target column name, the target column type, and any ImportColumn details by using getImportColumn(), which provides access to functions such as getAttributeValue().

The following shows an example of a custom transform element that could be specified in the schema (Note: this is actually the built-in dbDateTimeFromLong transform):

<ImportTransform name="Sample">
<Body>
   getFunction(ImporterColumnDefinition col, boolean strict)
     {
     String precision = col.getImportColumn().getAttributeValue("transformPrecision");
     switch (precision) {
     case "millis":
return  (long value) -> new DBDateTime(DBTimeUtils.millisToNanos(value));
     case "micros":
return (long value) ->  new DBDateTime(DBTimeUtils.microsToNanos(value));
     case "nanos": return DBDateTime::new;
     default: return null;
     }
</Body>
</ImportTransform>

When compiling the transform function, the target, and, if defined, source, data types are used to find a suitable Java functional interface with which to declare the getFunction. In the case of the example above, the target column type is DBDateTime, and the sourceType="long" attribute is added to the ImportColumn. This combination of source and target types is then used to create and execute the resultant function as a LongFunction<DBDateTime>.

A much simpler example follows:

<ImportTransform name="Sample">
<Body>
getFunction(ImporterColumnDefinition column, Boolean strict) {
return (long value) -> (2 * value);
}
</Body>
</ImportTransform>

If this is used for a column whose type is long, this will be mapped to a LongUnaryOperator, which takes a long and returns a Long. If the sourceType is not specified, the importer will attempt to use the target column type for matching a function interface.

If the function being returned from getFunction has a declared return type, this return type must exactly match the target column type for which the transform will be used. For example, a transform that will provide values for a Long column and that has a declared return type, must be declared Long, not long. However, the simpler, and recommended, implementation is to include only the transform body in the return from getFunction, so the system can perform implicit type conversion when compiling the transform.

Custom transforms can be added from the Schema Editor File menu using the View/Edit Schema XML option.

Primitive types and boxed types in transforms

Not every combination of primitive types is supported by the Lambda interfaces used to pass transform functions into field writers. The supported combinations for unboxed primitive types are any combination of source and target of int, double, and long.

Therefore, a transform that reads a long value from the source and transforms it to a double value in the target table would use the primitive types of long and double.

However, a transform that reads a byte value and writes a short value would not be able to use primitive types. The importer will map this transform as a Function<Byte, Short>. Note the capitalization. Byte and Short are the boxed (class-based) versions of the primitive types byte and short. This means there is extra overhead in the transform process to use the larger class-based versions of the data types.

Note: This concern with boxing of primitive types is only applicable to transforms. If the data can be implicitly converted (e.g., map a short to an int), or converted with a formula, then no boxing occurs. If the translation is too complex for a formula, another option for such primitive types is to write a custom FieldWriter. Since the FieldWriter classes are not interfaced through Lamdba functions, they have no limitations on which primitive data types they can use for source or target.

Another thing to be aware of is that primitive types that have a "null" default will be set to values from com.illumon.iris.db.tables.utils.QueryConstants when their sources are null. While these may be boxed when they are passed into a transform, they will not be automatically converted to true null values. If special handling is needed for null values in the transform, either check for equality to the QueryConstants values, or use a boxed type for the sourceType property of the column so these values will be preserved as null.

Array types

The importers support long[] and double[] array types. From the source, these are expected to be strings (either a string field from JDBC or a column value in CSV). Formatting of the string is as a bracketed list of values. Brackets can be square brackets [ ], parentheses ( ), or curly braces . The delimiter can be specified with arrayDelimiter, but the default delimiter is a comma. Empty arrays [] are also supported.

An example array of double strings follows:

[1.5, 6.4, 8.0, 2.2322, 88, 16.02]

Custom `FieldWriter` classes

Note

Custom Column Setters are currently supported only with the legacy CSV importer. To use the legacy parser, set the configuration property Importer.useLegacyCsvForImport to true (e.g., by adding -DImporter.useLegacyCsvForImport=true to extra JVM arguments).

Custom FieldWriter classes can be used where a field writer needs to use values from multiple source columns, or where it is more desirable to package transform logic into a FieldWriter class rather than in an in-line transform or imported code. Custom field writers must extend the source-specific FieldWriter class (i.e., CsvFieldWriter for CSV imports or JdbcFieldWriter for JDBC imports.) They must implement a specific constructor, and must also implement the processField method.

A custom field writer class project will need the following Deephaven jars added as dependencies:

IO-<version_number>.jar
Base-<version_number>.jar
Ingesters-<version_number>.jar
DataObjects-<version_number>.jar
BinaryStore-<version_number>.jar
IrisDataObjects-<version_number>.jar

An example of a custom field writer for JDBC follows:

package io.deephaven.fieldwriters;

import java.io.IOException;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.TimeZone;
import com.fishlib.io.logger.Logger;
import com.illumon.iris.binarystore.RowSetter;
import com.illumon.iris.importers.ImporterColumnDefinition;
import com.illumon.iris.importers.JdbcFieldWriter

/**
 * Sample custom field writer class that takes FirstName and LastName and writes out FirstName + ' ' + LastName
*/
public class FullNameWriter extends JdbcFieldWriter {

    private final RowSetter setter;
    private final ResultSet rs;
    private final ImporterColumnDefinition column;

  // This is the constructor format that is required for custom JdbcFieldWriters
    public FullNameWriter(Logger log, boolean strict, ImporterColumnDefinition column,
       final TimeZone sourceTimeZone,RowSetter setter, ResultSet rs, String delimiter) {
       super(log, column.getName(), delimiter, sourceTimeZone);
       this.setter = setter;
       this.column = column;
       this.rs = rs;
   }

    @Override
    public void processField() throws IOException {
        try {
             setter.set(rs.getString("FirstName") + " " + rs.getString("LastName"));
        } catch (SQLException e) {
             throw new RuntimeException("Failed to concatenate FirstName and LastName for target field: " + column.getName());
        }
    }
}

A custom field writer for CSV would be similar, but would not include ResultSet or TimeZone in its constructor arguments, and would require CsvRecord as an argument for processField.

import com.fishlib.io.logger.Logger;
import com.illumon.iris.binarystore.RowSetter;
import org.apache.commons.csv.CSVRecord;
import java.io.IOException;

public class CsvMultiColumnExample extends CsvFieldWriter {
    private final RowSetter setter;
    private final ImporterColumnDefinition column;

   //This is the constructor format that is required for custom CsvFieldWriters
    public CsvMultiColumnExample(Logger log, boolean strict, ImporterColumnDefinition column, RowSetter setter, String delimiter) {
        super(log, column.getName(), delimiter);
        this.setter = setter;
        this.column = column;
    }
    @Override
    public void processField(CSVRecord record) throws IOException {
        setter.set(record.get("FirstName") + " " + record.get("LastName"));
    }
}

Internally, XML uses CSV import processing, so the CSV custom field writer structure would also be usable for XML imports.

To use the custom field writer in the table, the class attribute of the ImportColumn is set, and an import entry will be needed to make the custom field writer class available to the import process:

<Imports>com.io.deephaven.fieldwriters.FullNameWriter</Imports>
...
<ImportColumn name="FullName" class="io.deephaven.fieldwriters.FullNameWriter" sourceName="FirstName" />

Caution

Note: Even though it is not used, sourceName is required due to a known issue in ImportColumn processing. Currently, this attribute must exist, and must indicate some other column that does exist in the data source as its value.

This can then be used to provide data for the FullName column in the table:

<Column name="FullName" dataType="String" columnType="Normal" />

Typically, the custom field writer class will be compiled and packaged in a jar that should be placed under /etc/sysconfig/illumon.d/java_lib on servers where data imports that need it will be run.

JdbcFieldWriters must extend JdbcFieldWriter and implement a constructor that accepts:
- Logger.class - logger instance passed down from main
- ImporterColumnDefinition.class - column definition of column to which this field writer will be attached (including ImportColumn` and target information)
- RowSetter.class - rowsetter for the target column
- ResultSet.class - JDBC result set with the current row. The field writer should only read values from result set metadata and current row data, not change the position of the result set or close it.
- String.class - delimiter value used for array parsing functions
- Boolean.class - strict value to indicate whether data checking is strict (true) or relaxed (false)
CsvFieldWriters must extend CsvFieldWriter and implement a constructor that accepts:
- Logger.class - logger instance passed down from main
- ImporterColumnDefinition.class - column definition of column to which this field writer will be attached (including ImportColumn and target information)
- RowSetter.class - row setter for the target column
- String.class - delimiter value used for array parsing functions
- Boolean.class - strict value to indicate whether data checking is strict (true) or relaxed (false)

Built-in functions and transforms

Some functions and transforms are built into the importer framework and can be used when designing ImportSource entries. Transforms often require other attributes to control their behavior. Functions can be used in formulas; transforms can be used in transforms.

Transforms

Name Description

dbDateTimeFromLong Takes a long input (offset from Unix epoch) and returns a DBDateTime. The attribute transformPrecision should be set to "millis", "micros", or "nanos" to indicate the units of the long offset value. For example, transformPrecision="millis" means that the long value from the source is the number of milliseconds from Unix epoch. Requires sourceType=long to match the expected input of the transform.

enumFormatter Takes an int input and returns a String, based on lookup from an enum defined in the system columns file. The attribute transformColumnSet is the String column set name to match in the columns file. The attribute transformEnum is the String name of the enum under the column set to match in the columns file. Requires sourceType=into match the expected input of the transform.

Name	Description
`dbDateTimeFromLong`	Takes a long input (offset from Unix epoch) and returns a DBDateTime. The attribute `transformPrecision` should be set to "millis", "micros", or "nanos" to indicate the units of the long offset value. For example, `transformPrecision="millis"` means that the long value from the source is the number of milliseconds from Unix epoch. Requires `sourceType=long` to match the expected input of the transform.
`enumFormatter`	Takes an int input and returns a String, based on lookup from an enum defined in the system columns file. The attribute `transformColumnSet` is the String column set name to match in the columns file. The attribute `transformEnum` is the String name of the enum under the column set to match in the columns file. Requires `sourceType=into` match the expected input of the transform.

Functions

Name	Input Type	Output Type	Description
`DBDateTimeUtils.nanosToTime()`	`long`	`DBDateTime`	This is the constructor for DBDateTime objects. By default, it takes a long offset of nanoseconds from Unix epoch. Additionally, `DBDateTimeUtils.millsToTime()` and `DBDateTimeUtils.microsToTime()` are available for parsing millisecond/microsecond offsets into timestamps. Standard Java date and time parsing and conversion functions can be used to create a long Unix epoch offset from other formats; the result of which can then be passed to this function to store as DBDateTimevalue. `java.time.*` and `java.time.format.DateTimeFormatter` are included in the standard imports. `ZonedDateTime.parse` can be used with various formats to parse strings into `ZonedDateTimes`; and the `ZonedDateTime.toInstant().toEpochMilli()` call can then be used, for example, to convert to long milliseconds to pass to this creator.
`dbDateTimeFromTimestamp`	`java.sql.Timestamp`	`DBDateTime`	Takes a Java Timestamp and returns a DBDateTime used internally, automatically, when importing JDBC datetime types to `DBDateTime` destination` columns.
`DBTimeUtils.convertDateTime`	`String`	`Date`	Takes a String and returns a DBDateTime. Expects a string of the form: `YYYY-MM-DDThh:mm:ss.nnnnnnnnn TZ`. One use of this function would be to convert dates that had originally been exported to CSV from a Deephaven system.
`accountingInt`	`String`	`int`	Takes a String and converts to an int. Converts enclosing parentheses to a negative value. Removes thousands separators. Expects US number formatting (comma for thousands separator). Requires `sourceType=String` to match the expected input of the function.
`accountingLong`	`String`	`long`	Takes a String and converts it to a long. Converts enclosing parentheses to a negative value. Removes thousands separators. Expects US number formatting (comma for thousands separator). Requires `sourceType=String` to match the expected input of the function.
`accountingDouble`	`String`	`double`	Takes a String and converts it to a double. Converts enclosing parentheses to a negative value. Removes thousands separators. Expects US number formatting (comma for thousands separator). Requires `sourceType=String` to match the expected input of the function.
`stripFirstCharacter`	`String`	`String`	Takes a String and returns a String. Strips the first character, assuming the String has at least one character. No overrides of `sourceType` needed because input and output types match.
`parseBoolean`	`String`	`Boolean`	Takes a String and returns a Boolean. Single character `T`/`F`, `t`/`f`, or `1`/`0` is converted to Boolean true or false. Strings (case-insensitive) true or false converted to Boolean true or false. Used internally, automatically, when reading a CSV to a destination column that is Boolean. Requires a sourceType=String for explicit use in custom formulas.
`getLongArray`	`String`	`long[]`	Takes a String and returns an array of long values. Two arguments - first is the String of array values, second is a String delimiter to parse the values. Input string is expected to be enclosed - normally by square brackets. Only single-dimensional lists (arrays) are supported; no matrices. Used internally, automatically, when importing to a long[] column. Requires `sourceType=String` for explicit use in custom formulas.
`getDoubleArray`	`String`	`double[]`	Takes a String and returns an array of double values. Two arguments: first is the String of array values, second is a String delimiter to parse the values. Input string is expected to be enclosed normally by square brackets. Only single-dimensional lists (arrays) are supported; no matrices. Used internally, automatically, when importing to a double[] column. Requires `sourceType=String` for explicit use in custom formulas.

`CopyTable` schemas

Note

The CopyTable feature applies only to System schemas. CopyTable schemas are not supported for user schemas.

One table layout may be used for multiple tables. When this is required, it is not necessary to replicate the entire source schema definition for each new table. Instead, a schema definition can use the CopyTable element instead of the Table element, referring to another source table for its definition. The CopyTable element then uses the source table's schema for all definitions, including columns, imports, loggers and listeners.

The schema file naming convention of <namespace>.<table>.schema also applies to CopyTable schema files. The namespace and table in the CopyTable schema filename should be those of the duplicate table, not the source table.

A CopyTable definition contains the following attributes:

namespace - the duplicate table's namespace
sourceNamespace - the source table's namespace
name - the duplicate table's name
sourceName - the source table's name

Note

The previous CopyTable syntax of namespaceSource and nameSource has been deprecated.

For example, the following would use the definition of the table "ExampleTable" from the namespace "ExampleNamespace" for a table called "ExampleTableCopy" in the namespace "CopyTableNamespace":

<CopyTable namespace="CopyTableNamespace" sourceNamespace="ExampleNamespace" name="ExampleTableCopy" sourceName="ExampleTable" />

A duplicate table functions in the same manner as any other Deephaven table. Data can be imported through the importer classes or real-time ingestion, partitions will be generated in the same ways, intraday data can be merged to the historical database, and queries will work as expected.

Loggers and listeners are not generated for CopyTable definitions, as the loggers and listeners from the source table will be used for the duplicate table.

Warning

Chained CopyTable definitions are not supported at this time; a CopyTable must refer to a fully defined Table schema.

Managing schema files

Once schemas have been defined or changed, they must be made visible to Deephaven (deployed), and if any custom loggers and listeners are defined, these must be generated.

The logger/listener generation can be done from deployed schemas, or from another location prior to schema deployment so the generated loggers can be integrated with the customer application before being deployed to Deephaven.

Two scripts are provided for managing schema files, and are described below:

dhconfig both deploys schemas to a Deephaven database, making their described tables available for use, and exports schemas from etcd to a specified directory. See: Configuration Tools.
generate_loggers generates the logger and listener classes from schemas. For more information on loggers and listeners see Streaming Data.

These scripts can be used during the initial setup of a new Deephaven system, as well as to update configuration when schemas are added or modified. Rerunning the scripts is safe, even if nothing has changed since the last time they were run.

Schema service types

Deephaven clusters use etcd to centralize storage of schema files, alongside additional configuration such as property files.

Schema locations and properties

On a Deephaven server, editing a schema file directly has no effect on the system. Edited schema files must be deployed before the system will use them. See below.

The following property controls aspects of schema behavior:

SchemaConfig.defaultListenerPackagePrefix - defines the default listener package prefix used if one is not defined in a schema. Any generated listeners will use this value followed by the namespace converted to lowercase if they don't have their own package defined in their schema file. The default value is: SchemaConfig.defaultListenerPackagePrefix=com.illumon.iris.intraday.gen

Deploying schemas to Deephaven

Once schemas are defined, or after they are modified, they must be imported into Deephaven's centralized configuration database (backed by etcd). This is done using the dhconfig utility, which uses the parameters defined at the command line to find the schema files. The Schema Editor also has an option to deploy schemas.

Until this step is done, schema changes are not visible to Deephaven.

Use the following command to perform schema deployment. Note that this command must be run as a user that has access to the Deephaven configuration server (e.g., sudo -u irisadmin).

/usr/illumon/latest/bin/dhconfig schemas import <options>

See dhconfig schemas for more details about the command options.

For example, on most Deephaven installations, the following command will redeploy the LearnDeephaven schemas:

/usr/illumon/latest/bin/dhconfig schemas import --directory /etc/sysconfig/illumon.d/schema/LearnDeephaven --force <auth options>

This will result in the following output:

Updating System.LearnDeephaven.EODTrades
Updating System.LearnDeephaven.StockQuotes
Updating System.LearnDeephaven.StockTrades

Migration Mode

When running the legacy deploy_schema tool in migration mode, if directories or files are not specified, the tool will instead use several properties to locate and process schema files. This is intended to be run one time, to import schema files from their legacy locations into etcd. Once this migration is complete, these properties will no longer be used by Deephaven to find schema files.

These legacy schema location properties include a list of locations (directories or JAR files) in which schemas can be found, delimited by semicolons or spaces. When looking for schema files, Deephaven (including the deployment and logger/listener generation scripts) will search all subdirectories of the specified locations for files with the .schema extension, as well as looking within JAR files in those directories and subdirectories for .schema files.

The schema location properties allow the following substitutions:

<devroot> - this refers to the installation directory of the Deephaven software, usually /usr/illumon/latest.
<workspace> - this refers to the workspace of the process.

The following properties specify the locations of schema files during migration mode:

SchemaConfig.allowedNamespaces - If this property is set when deploying schemas, then schema files will be imported only for the namespaces specified, and all specified namespaces will be created even if no schema files are found. The value is a comma delimited list of namespaces.
The SchemaConfig.resourcePath is a multi-valued property prefix which tells the system where to find schema files. You may add any suffix to the property name to add new paths without overwriting any other configured SchemaConfig.resourcePath values. You may add more than one directory by using a semicolon separator; you may also use wildcard matching by using an asterisk * (note that when using wildcards, your paths must end with a trailing slash). For example:
- SchemaConfig.resourcePath.irisInternal - The configured schema directory for internal Deephaven schemas.
- SchemaConfig.resourcePath.customer - By default, this path is the customer's schema directory: /etc/sysconfig/illumon.d/schema. If schemas will be placed in additional locations, you may freely add to this property, or create some other suffix with the additional locations.
- SchemaConfig.resourcePath.plugins - By default, this is a list of schemas for Deephaven plugins. If new plugins are added which have schema files, you may freely add to this property, or create some other suffix for the new plugin schema locations. If this property does not exist (on older installations), you should add it with a value of /etc/sysconfig/illumon.d/plugins/*/schema/. Note the trailing slash.

Deploying schemas during Intraday Data ingestion

Schema changes made while intraday data is being written to a table (for instance, through the tailer and data import server) will require special attention as Deephaven does not automatically handle schema changes while data is being ingested. This can be accomplished using the following procedure.

First, ensure the tailer is not sending data for the changed table by examining its log file (usually /var/log/deephaven/tailer/LogtailerMain1.log.current), and stop the tailer.
Move or rename any existing binary log files for that table for the day so they won't be found by the tailer. For example if the UpdatePerformanceLog schema was being changed:
```
sudo mkdir /db/TempFiles/save
sudo mv /var/log/deephaven/binlogs/perflogs/DbInternal.UpdatePerformanceLog*.bin.* /db/TempFiles/save
```
Assuming the table's internal partition is determined by the tailer configuration, update the tailer's XML configuration file and change (or add) the internalPartitionSuffix attribute for the changed table so that new intraday data will be written to a different location. If the internal partition is instead determined by the internalPartitionRegex attribute, update the logger to ensure it's logging with a different filename. For example: internalPartitionSuffix="TableVersion2"
If the new schema includes listener changes, the Data Import Server will need to be restarted so it can use the new listener classes.
Finally, once the new schema is deployed, the tailer can be restarted.

Note

Queries may not find the previously written intraday data for the day after this procedure is complete, but merges will find and write it correctly.

Editing schemas of existing tables

Once a table has data, there are multiple caveats regarding editing its schema:

Adding a new column to schema will show values as null when viewing the table for partitions that already existed at the time the column was added.
Removing a column from schema does not by itself remove data from disk, but the column will no longer be visible in queries.
Schema updates should normally be deployed after hours. If a schema update is applied mid-day, there are two options to maintain compatibility (see mid-day deployment details above):
- Create a new internal partition to separate the data written with the "old" schema from new data being written with the "new" schema.
- Add logFormat attributes to Listener sections of the schema as well as mappings so "old" version log files can populate the "new" version schema colunms.
Changing the datatype of a column requires rewriting the data in that column for all partitions.
Loggers that stream data to a table must match the schema in terms of which columns they provide.

Editing schemas with `ColumnTools`

Note

This feature is available starting in release 20210401.165

ColumnTools is a set of classes that automate and simplify schema editing tasks for tables that already have data. The ColumnTools class provides methods to modify or remove historical column data. Currently it supports only Deephaven format historical data, as Parquet files are not as conducive to replacing or modifying individual columns.

The main activities that are possible using columns tools are:

Update the contents of existing historical columns
Populate a new column for existing historical partitions
Change the data type of an existing column
Remove a column and its data files
Rename a column and its data files

Like other data writing operations (mergeData, etc), ColumnTools methods must be executed from the context of a merge server worker, so the underlying process will have write access to historical data files.

General considerations

When processing updates, ColumnTools methods will check several prerequisites. These are principally:

All partitions to be updated are using Deephaven format
All internal partitions to be updated are writable (see the Table Storage topic for details on writable vs read-only internal partitions)
The column being updated or deleted in not a grouping or partitioning column
The data type of new values matches the data type of the column being written

For multi-partition updates the ColumnTools class will first write all of the new column files to hidden temporary directories and then, when all the writing is complete, move the files to their permanent locations. The files' locations and their destinations are persisted to a state file on disk when all the writing is complete. This state file is also used to prevent multiple concurrent column changes to a table. See the troubleshooting section for steps to recover or cancel an update that was interrupted or failed after creating its state file.

Caution

No other write operations should be occurring on a historical table concurrently with ColumnTools modifications. This mainly means merge operations. ColumnTools changes should only be run at a time when they will not risk interacting with scheduled or ad-hoc merges of the table in question.

ColumnTools updates historical data and table schema, but does not update intraday data. If there is also unmerged intraday data, this data will be unreadable after changing a column data type. ColumnTools changes to tables that have intraday data should be done while related ingesters/tailers/loggers are offline. In some cases it will be necessary to delete and reimport intraday data after modifying a table with ColumnTools.

After modifying a table with ColumnTools, workers may have out-of-date details about the table. A refresh using db.reloadData() may be sufficient, but, in some cases, workers may need to be restarted to pick up new data and schema details.

Initializing the `ColumnTools` class

To use ColumnTools, the implementation class must be imported and a new instance instantiated. In Groovy:

import com.illumon.iris.importers.ColumnToolsImpl;

columnTools = new ColumnToolsImpl(log,
                                  com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
                                  db);

This is a basic instantiation. It will create a ColumnTools class that uses the logger, fatal error reporter, and database from the worker environment.

A more complex instantiation would be something like:

import com.illumon.iris.importers.ColumnToolsImpl;

columnTools = new ColumnToolsImpl(log,
                                  com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
                                  db,
                                  null,
                                  10);

This example passes two extra arguments.

The first (null) is an instance of com.illumon.util.progress.StatusCallback. This allows the ColumnTools methods to send progress updates as they write columns. In this case, though, since null is passed, a non-reporting status callback is created internally.

The second is more interesting. This defines how many threads to allocate for concurrent write operations. For maximum throughput, this should be equal to the number of internal partitions that exist for the table being updated. This is for column write operations. Other operations, like deleting column files, happen on a single thread, because they do not take much time.

Use from Python requires a few extra steps. The logger that is automatically available in Groovy must be explicitly created in Python, jpy.get_type() is used to import classes, and also None is used instead of null:

ColumnToolsImpl = jpy.get_type("com.illumon.iris.importers.ColumnToolsImpl")
Level = jpy.get_type("com.fishlib.io.log.LogLevel").INFO
Logger = jpy.get_type("com.fishlib.io.logger.NullLoggerImpl")
log = Logger(Level)
fatalErrorReporter = \
  jpy.get_type("com.fishlib.util.process.ProcessEnvironment").getGlobalFatalErrorReporter()

columnTools = ColumnToolsImpl(log, fatalErrorReporter, db, None, 10)

Updating the contents of existing columns

The two activities - updating the contents of a column and populating a new column are the same process with ColumnTools. It is only the scenarios that are different. When a new column is added to a table that already has historical data, that column will be null for partitions that have already been written. The updateColumn method can be used to fill the column for historical partitions. Similarly, if there is a column that has data that needs to be updated - or corrected, the updateColumn method can do that as well.

updateColumn has three user-facing forms:

public void updateColumn(@NotNull final String namespace,
                         @NotNull final String tableName,
                         @NotNull final String partitioningColumnValue,
                         @NotNull final String columnName,
                         @NotNull final String updateStatement)

public void updateColumn(@NotNull final String namespace,
                         @NotNull final String tableName,
                         @NotNull final String columnName,
                         @NotNull final String updateStatement)

public void updateColumn(@NotNull final String namespace,
                         @NotNull final String tableName,
                         @NotNull final String partitioningColumnValue,
                         @NotNull final String columnName,
                         @NotNull final Table sourceTable)

The first form allows specifying a namespace, table name, column name, and the partitioning column value (typically Date) for which an update should be run. The last argument is a Deephaven updateView expression that will be executed to provide the new values for the column.

Example:

columnTools.updateColumn("NS1,"Sample","Value2","Value2=Value1*2");

This example would update the Sample table in the NS1 namespace to set the Value2 column to the result of Value1 times 2.

The second form omits the partitioningColumnValue argument. This form will instead update all historical partitions that exist for this table.

The third form takes a sourceTable instead of an updateStatement. This form is for use when the logic to produce values for the column being updated is more complex than what can be done in a simple updateView. To use this form, the user first calls getSourceTable to retrieve a version of the table with internal partitioning information. A table can then be created from this one, using any capabilities within the Deephaven query language (joins, function calls, etc), and is passed back to updateColumn to provide the source for the new values.

public Table getSourceTable(@NotNull final String namespace,
                            @NotNull final String tableName,
                            @NotNull final String partitioningColumnValue)

The sourceTable passed into updateColumn must meet several requirements relative to the original provided by getSourceTable:

The internal partition column (named X_Internal_Partition_X by default) must be preserved
The row count must be the same
The number of internal partitions must be the same
The number of rows per internal partition must be the same
The data cannot have been re-sorted
It must contain the column being updated
The data type of the column being updated must be unchanged

Most of these requirements are checked by the updateColumn method call, and an exception will be thrown when one of them is found not to be met.

A simple example of getting a source table:

import com.illumon.iris.importers.ColumnToolsImpl

columnTools = new ColumnToolsImpl(
  log,
  com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
  db,
  null,
  10)

st = columnTools.getSourceTable("LearnDeephaven","StockQuotes","2017-08-25")

This example returns a source table which can then be updated with complex table operations such as joins to populate columns with data to be used in subsequent updateColumn calls.

To update an entire historical table with logic that requires the use of getSourceTable, it will be necessary to write a loop that iterates over selectDistinct(<partitioningColumn>) values from the table, gets a source table, operates on it, and then calls updateColumn.

Outline of a looping source table updateColumn call:

import com.illumon.iris.importers.ColumnToolsImpl

// This example is partial, and assumes an AdjPrice column has been added to the schema
String namespace = "LearnDeephaven"
String tableName = "StockTrades"
String columnToUpdate = "AdjPrice"

columnTools = new ColumnToolsImpl(
  log,
  com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
  db,
  null,
  10)

dates = db.t(namespace, tableName).selectDistinct("Date").getColumn(0).getDirect()

for (String date in dates) {
	Table st = columnTools.getSourceTable("LearnDeephaven","StockQuotes",date)

	// Do some table operations on st to calculate AdjPrice values

	// Update a column in the current date parition from the modified version in st
	columnTools.updateColumn(
		namespace,
		tableName,
		date,
		columnToUpdate,
		st)
}

Changing the data type of a column

This method is meant to be used for cases where existing data can be converted into a new type using a simple conversion expression (e.g. int to long, or long to int where all values are small enough for int). If more complex table operations are needed, it may be necessary to add a new column, populate that using the sourceTable form of updateColumn, and then remove the old column with deleteColumn.

Internally, when changing a column data type, ColumnTools iterates through the existing historical partitions and writes temporary column files for the new format. When all files have been written, the files are then moved to replace the previous ones that used the old data type, and, lastly, the schema is updated to reflect the new data type.

public void changeColumnType(@NotNull final String namespace,
                             @NotNull final String tableName,
                             @NotNull final String columnName,
                             @NotNull final Class&lt;?> newDataType,
                             @NotNull final String updateStatement)

Example:

import com.illumon.iris.importers.ColumnToolsImpl;

columnTools = new ColumnToolsImpl(log,
                                  com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
                                  db,
                                  null,
                                  10);

columnTools.changeColumnType("NS1","Sample","Value2",int.class,"Value2=(int)Value2");

This example changes the Value2 column of the Sample table in the NS1 namespace to the int primitive type. The fourth argument (newDataType) is a Java class of the new type. The fifth argument (updateStatement) is similar to the updateStatement argument used in updateColumn. This expression must provide a value that matches the data type to which the column is being changed. In some cases a simple cast will work - as in this example. In others, more complex expressions may be needed - for instance to convert a String value to a numeric or DateTime one.

Deleting a column

The deleteColumn method removes column data files from existing historical partitions and removes the column from the schema. This is a simpler and faster process than changing a column's data type, and it runs steps in a different order. Since the column will be removed regardless, the method first removes the column from the schema, and then iterates through the existing historical partitions to delete the column files. This method call will be rejected if there is already a columnUpdate running for the table.

Note that if this method is executed against a column that does not exist in the schema, it will still run and try to remove any data files on disk associated with the column. This allows this method to be easily rerun in case it is interrupted, and also allows it to be used to clean up column files left over after a column has been previously removed from schema using other methods.

public void deleteColumn(@NotNull final String namespace,
                         @NotNull final String tableName,
                         @NotNull final String columnName)

Example:

import com.illumon.iris.importers.ColumnToolsImpl;

columnTools = new ColumnToolsImpl(
log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db);

columnTools.deleteColumn("NS1","Sample","Value2");

This example deletes the Value2 column from the Sample table in the NS1 namespace, and also deletes all existing historical column files for this column.

Renaming a column

The renameColumn method changes a column name in the schema and for existing historical column files on disk. Like deleteColumn, this method updates the schema first and then modifies column files, since this should be a much faster process than writing new files. In case data is accessed from a table while not all files have yet been renamed, partitions that still have the old names for column files will return null values. As it is for other ColumnTools operations, though, it is strongly recommended to have no other processes attempting to read data from the table while renaming a column.

The new column name must be unique to the table and a valid Deephaven column name. If it is not, an IllegalArgumentException will be thrown with details of the failure in validating the new column name.

renameColumn will also rename Logger/Listener columns and system inputs, and ImportSource columns. Note, however, that if an ImportSource column being renamed includes a formula, that formula will need to be updated if it includes any references to the old column name.

Note

Because renaming a column also renames corresponding Logger/Listener entries, log files created with the pre-rename version of the schema will not be replayable or loadable (e.g. with readBin) with the post-rename schema. To allow support for older log files, manually add a versioned Listener section that includes mappings from pre to post rename columns.

public void renameColumn(@NotNull final String namespace,
                         @NotNull final String tableName,
                         @NotNull final String columnName,
                         @NotNull final String newColumnName)

Example:

import com.illumon.iris.importers.ColumnToolsImpl;

columnTools = new ColumnToolsImpl(
log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db);

columnTools.renameColumn("NS1","Sample","Value2","SomeNewColumnName");

This example renames the Value2 column, and its corresponding historical column data files, to SomeNewColumnName.

Troubleshooting and recovery steps

Most operations through ColumnTools will validate state and data before making changes. Problems such as invalid column names and most types of malformed input in a sourceTable will throw exceptions that provide clear details about why the request could not be processed.

An IllegalStateException, with little additional detail about what caused the exception, may occur if ColumnTools operations are attempted from a query worker instead of a merge worker, because query workers do not have write access to column files.

One erroneous scenario which cannot be detected automatically is that if a sourceTable reorders rows for an updateColumn (e.g. via a sort or aggregation), then the updateColumn will run, but the updated column will have incorrect values. The system ignores the other columns in the sourceTable which differ from the table on disk, therefore the newly constructed table will have mismatched values.

When processing an updateColumn for all partitions, or changeColumnType, the execution proceeds in these phases:

Write a state file <table_name>_ColumnUpdate.map to the root of the historical namespace typically /db/Systems/<namespace>/
Write new column files for all partitions to hidden temporary directories
Update the state file with a map of temporary and permanent directories for the table
Iterate through all partitions from the map in the state file and replace original column files with the new ones from the hidden temporary directories
In the case of changing column type: update the schema with the new type
Delete the state file

All columnUpdate forms use the state file on disk to prevent multiple concurrent ColumnTools updates to a table. renameColumn and deleteColumn do not create a state file, but will reject a request if a state file is found to exist when they start.

In the case where the process is interrupted in step 2, nothing has yet changed for the permanent table, and a fresh execution of the update/change statement can start over from the beginning. In the case where the process is interrupted in step 4, some files may have already been replaced, and the overall change should be resumed at step 4 to complete the update.

The restartFinalization method is used to restart steps 4 - 6, or, in the case when the state file is empty, to remove it so a new update can be started.

public void restartFinalization(@NotNull final String namespace, @NotNull final String tableName)

Other ColumnTools change methods, like renameColumn and deleteColumn, are designed to be idempotent in case they are interrupted while making file system changes. When they are re-run after interruption, renameColumn will attempt to find and rename any remaining column files with the old name, and deleteColumn will attempt to find and delete any remaining column files for the column that has been deleted.

In addition to failure events, which result in thrown exceptions, the ColumnTools methods log a fair amount of step detail as Info messages. These can be viewed in the ProcessEventLog, or by enabling Info logging in the Log panel of a console.

In the most severe failure scenarios, a fatal error may occur in the column writing code; this will kill the worker. More detail can be obtained about such a failure by querying the ProcessEventLog for the worker in question and sorting by Timestamp (since the fatal error reporter events will likely be out of order relative to regular worker events in the log). Events of interest in this case will typically be Errors, and will be very near the max Timestamp values.

Exporting schemas from Deephaven

The dhconfig schemas export command exports schemas from etcd to a specified directory as .schema files, named <namespace>.<table name>.schema. These files can then be examined and edited, and re-imported with dhconfig import.

For example, the following command will export the LearnDeephaven schema files to the directory /tmp/export, overwriting any files that are already there.

/usr/illumon/latest/bin/dhconfig schema export --directory /tmp/export --namespace LearnDeephaven --force

Listing schemas from Deephaven

The dhconfig schemas list command also has an option to simply list schemas rather than export them.

For example, the following command will list the LearnDeephaven schema files.

/usr/illumon/latest/bin/dhconfig schema list --namespace LearnDeephaven

Tables & schemas

Basic Table Attributes

Columns

Column Codecs

Partitions

Partition Key Formulas

Conventions and Recommendations

Merge Attributes

Data Indexes

Schema inference

Command line interface

CSV

JDBC

XML

JSON

From a Table

Converters and adding custom converters

ImportSource XML in a schema file

ImportColumn

Imports

maxError

arrayDelimiter

ImportTransform

Custom transforms

Primitive types and boxed types in transforms

Array types

Custom FieldWriter classes

Built-in functions and transforms

Transforms

Functions

CopyTable schemas

Managing schema files

Schema service types

Schema locations and properties

Deploying schemas to Deephaven

Migration Mode

Deploying schemas during Intraday Data ingestion

Editing schemas of existing tables

Editing schemas with ColumnTools

General considerations

Initializing the ColumnTools class

Updating the contents of existing columns

Changing the data type of a column

Deleting a column

Renaming a column

Troubleshooting and recovery steps

Exporting schemas from Deephaven

Listing schemas from Deephaven

`ImportColumn`

`Imports`

`maxError`

`arrayDelimiter`

`ImportTransform`

Custom `FieldWriter` classes

`CopyTable` schemas

Editing schemas with `ColumnTools`

Initializing the `ColumnTools` class