Tables & schemas
A Deephaven table schema is an XML document that describes a table. A schema must include a table's namespace, name, column names and data types. In addition to specifying the structure of the data, a schema can include directives controlling how data is imported and stored.
Schema files can be created by directly constructing XML files (e.g., using a text editor), or can be generated using built-in tools. In most cases, the easiest way to create or edit a schema file is by using the Deephaven Schema Editor tool. The Schema Editor itself can be used to create a new schema from scratch, edit an existing schema, or derive a new schema from example data.
If a schema is created manually, this is typically done by picking an existing schema file to use as a starting point, and then copying, pasting, and modifying it to create a suitable layout for the new table. This does not require any special tools, but it can be tedious for complex tables, or when many tables are needed. To simplify this process, Deephaven includes tools that can analyze data sets and infer column details and import instructions, and that can also generate and validate sections and entries within schema files. The schema creation tools make schema creation faster and easier, and help avoid issues related to formatting and syntax. These tools are described briefly in the section on Schema Inference, and in detail in the chapter covering each specific data type (CSV, XML, etc.).
Once a schema is defined, it must be deployed to take effect. If the table is intended for use by streaming data, an additional step is required to generate loggers and listeners. This is discussed further in Deploying Schemas to Deephaven below.
Note
See: This part of the documentation covers general schema requirements and features. Additional optional sections of schema are detailed in Importing Batch Data and Importing Streaming Data.
Basic Table Attributes
Metadata is imported to the database using XML-based schema files that contain definitions like the following examples:
<Table namespace="ExampleNamespace" name="ExampleTable" storageType="NestedPartitionedOnDisk">
<Partitions keyFormula="${autobalance_single}"/>
<Column name="Date" dataType="String" columnType="Partitioning" />
<Column name="Alpha" dataType="String" columnType="Grouping"/>
<Column name="Bravo" dataType="byte" />
<Column name="Charlie" dataType="int" />
<Column name="Delta" dataType="double" />
<Column name="Echo" dataType="java.util.List" />
</Table>
<Table namespace="Ise" name="Quotes" storageType="NestedPartitionedOnDisk" defaultMergeFormat="DeephavenV1">
<Partitions keyFormula="${autobalance_by_first_grouping_column}"/>
<Column name="Date" dataType="String" columnType="Partitioning" />
<Column name="Ticker" dataType="String" columnType="Grouping" />
<Column name="Size" dataType="int" columnType="Normal" />
<Column name="Price" dataType="double" columnType="Normal" />
</Table>
<Table namespace="Korea" name="Securities" storageType="SplayedOnDisk">
<Column name="UnderlyingId" dataType="int" columnType="Grouping" />
<Column name="Id" dataType="int" columnType="Normal" />
<Column name="Ticker" dataType="String" columnType="Normal" />
<Column name="SecurityType" dataType="String" columnType="Normal" />
<Column name="Expiration" dataType="DateTime" columnType="Normal" />
</Table>
Each schema file contains definitions for exactly one table, and must follow the naming convention <namespace>.<table>.schema
. For example, the above examples would be named "Ise.Quotes.schema" and "Korea.Securities.schema".
The following attributes are necessary for all Table elements:
Attribute | Description |
---|---|
namespace | The system namespace (e.g., Ise or Korea) to which this table belongs. A namespace is analogous to a folder on the filesystem — they are used to organize the data, and one namespace typically contains many tables (just as a folder typically contains many files). |
name | The name of this table (e.g., Quotes or Securities). |
storageType | Specifies the way data for this table will be organized on disk. Options:
|
defaultMergeFormat | Specifies the format in which merged (historical) data will be written.
|
Columns
The <Column>
elements contain the detailed information to define each table's columns, including the column's name, data type, and role as a partitioning or grouping column.
At minimum these elements must contain the following attributes:
Attribute | Optional/Required | Default | Description |
---|---|---|---|
name | Required | N/A | The name of the column as it will be stored in the database. |
dataType | Required | N/A | The type of the data stored in a column. This can be one of the following:
Integer will be automatically treated as int . This treatment does not apply, however to array types. An array of 32-bit values must be declared as int[] . |
columnType | Optional | Normal | The role of the column in the table, one of the following:
|
objectCodec | Optional | None | A java class which controls how the column is stored on disk. See below for details on the provided codecs:
|
objectCodecArguments | Optional | None | Argument(s) provided to the objectCodec. How this value is interpreted depends on which codec is specified. |
symbolTable | Optional | ColumnLocation | For String columns only, this specifies whether or not a symbol table will be used to store the data. Supported values are:
ColumnLocation is used. See: Symbol Tables |
encoding | Optional | ISO_8859_1 | For String columns only, this determines the encoding used to store the value. Supported values are:
ISO_8859_1 is used. |
Column Codecs
Column codecs are used to provide custom deserialization logic for complex data types.
Codecs convert between raw bytes (i.e., the Java type byte[]
) and the column's data type (as specified by
the dataType
attribute). The output type of the codec must match the column's type. Additional codecs can be provided
by adding custom implementations of the com.illumon.util.codec.ObjectCodec<TYPE>
interface to the classpath (for
example, in /etc/sysconfig/illumon.d/java_lib
).
The following column codecs are provided with Deephaven. For columns of these types, it is much more space efficient to use the corresponding codec than to rely on the default java serialization.
Class Name | Column Type | Codec Description | Arguments | Argument Descriptions |
---|---|---|---|---|
com.illumon.util.codec.BigDecimalCodec | java.lang.BigDecimal | [<precision> [, <scale>]] | The precision argument specifies the maximum number of decimal digits that may be stored. The scale argument specifies the maximum number of decimal digits after decimal point. If precision is specified but scale is not, a scale of 0 is inferred. If precision is specified, a space-efficient fixed size encoding is used. Otherwise, a variable size encoding is used, with a maximum precision of 1000 . Null values are permitted. | |
com.illumon.util.codec.BigIntegerCodec | java.lang.BigInteger | [<precision>] | If specified, precision is used to specify the maximum number of decimal digits storable by this column. This codec uses the same encoding as BigDecimalCodec with a scale of zero. Null values are permitted. | |
com.illumon.util.codec.ByteArrayCodec | byte[] | [<size> [, nullable | notnull]] | If size is specified, values are encoded as fixed-size values. If not, a variable-size encoding is used. A flag indicating nullability may also be specified. The default is nullable . Specifying notnull saves one byte per value. | |
com.illumon.util.codec.StringBooleanMapCodec | java.util.Map<String, Boolean> | Encodes a map with String keys and Boolean values. | None | N/A |
com.illumon.util.codec.StringDoubleMapCodec | java.util.Map<String, Double> | Encodes a map with String keys and Double values. | None | N/A |
com.illumon.util.codec.StringFloatMapCodec | java.util.Map<String, Float> | Encodes a map with String keys and Float values. | None | N/A |
com.illumon.util.codec.StringLongMapCodec | java.util.Map<String, Long> | Encodes a map with String keys and Long values. | None | N/A |
com.illumon.util.codec.StringIntMapCodec | java.util.Map<String, Int> | Encodes a map with String keys and Integer values. | None | N/A |
com.illumon.util.codec.StringStringMapCodec | java.util.Map<String, String> | Encodes a map with String keys and String values. | None | N/A |
The following example schema includes columns using the BigDecimalCodec
, BigIntegerCodec
and ByteArrayCodec
codecs described above.
<Table name="CodecsExampleTable" namespace="ExampleNamespace" storageType="NestedPartitionedOnDisk">
<Partitions keyFormula="${autobalance_single}"/>
<Column name="Date" dataType="String" columnType="Partitioning" />
<Column name="Alpha" dataType="String" columnType="Grouping"/>
<Column name="Bravo" dataType="int" />
<Column name="Charlie" dataType="double" />
<Column name="Delta" dataType="double" />
<Column name="Echo" dataType="StringSet" />
<Column name="Foxtrot" dataType="java.math.BigDecimal" objectCodec="com.illumon.util.codec.BigDecimalCodec" objectCodecArguments="13,7"/>
<Column name="Golf" dataType="java.math.BigInteger" objectCodec="com.illumon.util.codec.BigIntegerCodec" objectCodecArguments="13"/>
<Column name="Hotel" dataType="byte[]" objectCodec="com.illumon.util.codec.ByteArrayCodec" objectCodecArguments="32,notnull"/>
</Table>
For streaming data ingestion, the schema must also be updated so that the loggers and listeners use the codec. Please see the section on Column codecs in loggers and listeners for more details.
Partitions
Deephaven has two distinct uses of the term "partition" as it applies to schemas. Both are required for partitioned tables (i.e., those using the NestedPartitionedOnDisk
storage type).
- A partitioning column is a String column that must contain an identifiable value for each set of data appended to the table. This is required for all partitioned tables.
- The historical storage partitions, along with the partitioning key formula, are used to share table data across multiple storage locations as the data is merged into the historical database. This is required for all partitioned tables.
Typically, date strings (such as 2023-01-31
) are used for the partitioning column value, but any string that
meaningfully partitions the data can be used. Table data is stored on disk in separate directories based on the
partitioning column value, which allows queries to filter data very efficiently based on partition. For example, if a
table is partitioned by date, then a filter such as .where("Date=`2023-01-31`", "Sym=`AAPL`")
would only need to
read data from directories for the date 2023-01-31
. Since most queries must filter by date when initially retrieving
data, it is generally recommended to use dates as partitioning columns. Further,
the ISO 8601 date format of yyyy-MM-dd
(e.g., 2023-01-31
) is recommended.
Unlike other date formats, the yyyy-MM-dd
format can easily be sorted because its lexicographical order matches its
chronological order.
Partitioning columns are specified by using the columnType="Partitioning"
attribute in a column definition. Currently, partitioned tables must have exactly one such column. Further information on column definitions is provided below.
When writing merged historical table data to disk, Deephaven will place partitions of the input data in a directory identified by the namespace, storage partition, column partition, and table name.
Partition Key Formulas
Historical storage partitions correspond to directories available for historical storage on the disk; for scaling and performance benefits, these are normally mapped to different physical storage devices, whether local or network-connected.
Historical storage partitions are specified using the Partitions element, which uses the keyFormula
attribute to map table rows to the available storage partitions. Specifically, this formula is used by merge operations in an updateView
operation, the result of which is used to assign rows to a storage partition .
The following QueryScope
parameters are available as input to the formula, in addition to the columns of the table:
Parameter | Description |
---|---|
__WRITABLE_PARTITIONS__ | An array of partition name strings corresponding to directories found at /db/Systems/Namespace/WritablePartitions . |
__NUM_WRITABLE_PARTITIONS__ | The number of available partitions (i.e., __WRITABLE_PARTITIONS__.length ) |
__NAMESPACE__ | A string representing the namespace. |
__TABLE_NAME__ | A string representing the table name. |
__NS_TN__ | A string representing the namespace and table name, joined with slash as a separator (e.g., Ise/Quotes or Korea/Securities ) |
__PARTITIONING_COLUMN_VALUE__ | The partitioning column value for this merge operation. |
__DAY_OF_YEAR__ | The day of the year, only available when __PARTITIONING_COLUMN_VALUE__ appears to be a date in YYYY-MM-DD form. |
__PARTITION_AUTOBALANCE_SINGLE__ | A complete formula that automatically chooses a single partition for all data based on __NS_TN__ and __PARTITIONING_COLUMN_VALUE__ , attempting to achieve balanced partition usage over long time horizons is deprecated in favor of ${autobalance_single} . |
In addition, the following substitutions will be applied, in order to make writing appropriate partitioning formulas easier:
Token | Substitution Description |
---|---|
${autobalance_single} | A complete formula that chooses a single partition for all input data, attempting to achieve balanced partition usage over long time horizons. The preferred way to write a formula that only refers to the __PARTITION_AUTOBALANCE_SINGLE__ query scope parameter. |
${autobalance_by_first_grouping_column} | A complete formula that uses the first grouping column (if there is one) in order to spread data across all available partitions, attempting to achieve balanced partition usage over long time horizons. Assuming the first grouping column is the most significant, keeps all grouped data together. Falls back to the same behavior as ${autobalance_single} if there are no grouping columns. |
${autobalance_by_all_grouping_columns} | A complete formula that uses all grouping columns (if there are any) in order to spread data across all available partitions, attempting to achieve balanced partition usage over long time horizons. Effectively spreads grouped data across multiple partitions, which may be useful for queries. Falls back to the same behavior as ${autobalance_single} if there are no grouping columns. |
${first_grouping_column} | A string representing the name of the first grouping column (i.e., the first column definition with the attribute columnType="Grouping" ). Only available if there are grouping columns. |
${all_grouping_columns} | An array of strings representing the names of the grouping columns (i.e., column definitions with the attribute columnType="Grouping" ). Only available if there are grouping columns. |
Conventions and Recommendations
Some common conventions follow:
- Large Tables: Tables with large amounts of data per column partition typically try to balance across all available storage partitions, while keeping related data together, often choosing to use a grouping column as part of the formula to achieve this. The easiest way to achieve this is to use the following formula, relying on token substitution:
${autobalance_by_first_grouping_column}
- For an expanded example, the
Ise/Quotes
table above might use the following formula:__WRITABLE_PARTITIONS__[abs(Objects.hash(__NS_TN__, Ticker) + __DAY_OF_YEAR__) % __NUM_WRITABLE_PARTITIONS__]
This serves to distribute data with the same value for the Ticker column together, while spreading data across all available partitions, taking into account day of year, namespace, and table name in order to approximate long term load balancing. - Small tables: Tables with small amounts of data per column partition often simply use
${autobalance_single}
, keeping data in a relatively compact format while utilizing all historical partitions over longer time horizons in order to approximate load balancing.
Merge Attributes
When merging data to historical partitions, Deephaven supports writing to traditional Deephaven files as well as writing to Parquet files. When writing to Parquet, a default compression codec can be chosen by adding a MergeAttributes element with an appropriate Parquet-supported codec. This codec can be changed by specifying a different one in the merge operation. If no codec is chosen, then the SNAPPY codec is used. This element does not apply to merges in Deephaven format.
Note
See: Parquet Codecs
For example, to use the uncompressed codec:
<MergeAttributes format="Parquet" codec="UNCOMPRESSED" />
Data Indexes
In addition to Grouping, Deephaven supports the traditional concept of Indexing historical data. A Data Index is a table on disk which maps each unique key, defined by the set of Index columns, to the indices within the table at which they appear. These tables allow the Deephaven Query Engine to optimize certain operations such as Filtering, Aggregating, and Joining data.
Data Indexes are defined using the <DataIndex columns='...'>
XML tag in the schema. Each <DataIndex...>
entry directs the Merge process to group the data being merged by the set of columns (called keys) in the columns
attribute. The Data Index is then written to disk, sorted in the same order as the columns are specified in the <DataIndex...>
tag. Choosing an ordering of these
columns based on how they are expected to be used in queries can improve the performance of operations that can take advantage of Data Indexes.
For example, the following snippet defines two separate Data Indexes, one on USym, and one on USym and Exchange:
<DataIndex columns="USym">
<DataIndex columns="USym,Exchange">
Schema inference
Schema inference from source data can be performed from the command line, or through the Schema Editor interface. These are the schema generator tools available, and the command lines used to invoke them. You may use iris_exec
to invoke any of the schema creators without arguments, to display the available options. The arguments are also described in detail in the section covering the specific data type (CSV/JDBC/XML/JSON).
Note
See: To learn more about iris_exec commands, please refer to Running Local Scripts.
Command line interface
-
iris_exec csv_schema_creator <launch args> -- <schema creator args>
-
iris_exec jdbc_schema_creator <launch args> -- <schema creator args>
-
iris_exec xml_schema_creator <launch args> -- <schema creator args>
-
iris_exec json_schema_creator <launch args> -- <schema creator args>
The table schema files created from these processes also support merge operations, and allow specification of a grouping column and/or the partitioning column name. After a schema file is generated, it is highly recommended the user review the schema and the log file, and update details as needed. You must deploy your schema and generate loggers/listeners. See below for more details on the contents of a generated schema file and the types of changes that might be needed.
CSV
CSV refers to a variety of text file data formats. Although they are called "comma-separated", the delimiter could in fact be some other character (such as a tab or semicolon). Other details of formatting are similarly flexible. There is no single official standard for CSV formatting, so Deephaven provides flexibility when defining how to interpret and transform text data from CSV files into structured storage in Deephaven tables.
Because CSV is a text file format that provides no metadata, the schema generation tool must analyze the data that it finds in a sample data file. Typically column names are derived from column headers, but there are options for CSV files with and without headers. The schema generation process can be simplified and accelerated somewhat by using a sample of data, as long as there are enough rows to clearly demonstrate the formats and ranges of data values that will be included in normal imports.
For non-String columns, the schema generator will scan all the values in the supplied CSV file to ensure the inferred data type remains valid for all values in the column. As an example, the first value for a column might be 4. This would cause the schema generator to tentatively categorize this column as an integer type (short or long, depending on settings). If a 5.5 was encountered in a subsequent row, the generator would "promote" the column to a floating point datatype. If another value in the column included non-numeric characters, such as XYZ, the generator would then mark the column as a String column. Once a column has been designated as a String column, no other checking is done, as Strings are always valid to accept any data from a CSV.
If a column is empty for all rows, it is marked as a String and a warning message is written to the log. In this case, the administrator should examine the generated schema and validate the generated XML before adding it to Deephaven. Date/time values are also a special case, where the schema generator will attempt to match the format of the date/time string to a known format for which it can provide a conversion formula. Like the earlier example with numbers and Strings, if a date/time column has multiple different formats in it, it will end up being marked as a String column, since no single formula could be found that matched all of the column's data.
Note
See: Importing CSV Files
JDBC
Java Database Connectivity is a standard that describes how data store drivers provide common Java interfaces to a wide variety of databases and other data storage mechanisms. Unlike CSV, JDBC provides data type information for columns, and the abilities to filter and manipulate data using a SQL query.
In most cases it is easier to generate schema files from JDBC than from CSV because of the result set metadata that JDBC data sources can provide. The default behavior of the JDBC schema generator is to request data types of the result set columns from the data source. These data types are then mapped to Deephaven data types to provide the schema details.
For cases where the data source cannot provide column data type information, or where the source database is not ideally configured to store the data (e.g., numbers or date/times in String fields), the JDBC schema generator provides options to scan data as is done in the CSV schema generator, or to use a combination of data scanning and metadata.
Note
See: Importing from JDBC
XML
XML (eXtensible Markup Language) is a container format that can contain sequential and/or hierarchical data. Like CSV, XML by itself does not provide or guarantee a particular layout to the data. Simpler data sets that use XML typically have one element per record, are not hierarchical, and use either attributes or element values to store record values. More complex XML feeds may include metadata, common data updates, and hierarchically related records. Deephaven provides "generic" XML data import capabilities that can accomodate most of the simpler forms. The more complex XML data feeds, including those that include data that must be imported to multiple tables in a single operation, require custom importers that are specific to the data format encapsulated within the XML (e.g., FIXML).
Although XML documents can include metadata, there is no standard as to how such column data types and other such information should be formatted. Therefore, for generic XML imports, Deephaven treats XML files similarly to how it handles CSV files. To find column data types, the schema generation tool analyzes all data that it finds in a sample data file. In addition, if named (rather than positional) values are used, the schema generator will also scan all importable elements to find column names that will be needed for the table.
Note
See: Importing XML Files
JSON
JSON is a hierarchical format representing data objects and arrays. In order to convert this format to a Deephaven table, JSON data is recursively converted into a stream of lines equivalent to a CSV file (with one "line" per top-level JSON object). The type inference then operates in the same was as it does on CSV data.
Column names for a table inferred from JSON are derived from the field names contained in the JSON objects. Since there is no "header" in JSON, all or a large portion of the data is explored to enumerate all columns (since any one JSON object may be missing a given field). JSON array elements are converted into column names by combining the field name with a numeric suffix reflecting the position in the array.
Note
From a Table
A schema can be inferred from the column names and types of a Deephaven table using the SchemaService
.
The following commands create and deploy a schema for the system table ExampleNamespace.TableName based on the definition of table
.
import com.illumon.iris.db.schema.SchemaServiceFactory
import com.illumon.iris.db.schema.NamespaceSet
import com.illumon.iris.db.tables.TableDefinition
namespace = "ExampleNamespace"
tableName = "TableName"
table = newTable(stringCol("Letter", "A", "C", "F", "B", "E", "D", "A"),
intCol("Number", NULL_INT, 2, 1, NULL_INT, 4, 5, 3),
stringCol("Color", "red", "blue", "orange", "purple", "yellow", "pink", "blue"),
intCol("Code", 12, 13, 11, NULL_INT, 16, 14, NULL_INT))
schemaService = SchemaServiceFactory.getDefault()
if(!db.getSystemNamespaces().contains(namespace)) {
schemaService.createNamespace(NamespaceSet.SYSTEM, namespace)
}
schema = schemaService.fromDefinition(table.getDefinition(), namespace, tableName,
TableDefinition.STORAGETYPE_NESTEDPARTITIONEDONDISK, NamespaceSet.SYSTEM)
schemaService.addSchema(schema)
Converters and adding custom converters
The schema creators have a number of built-in converters to recognize various date/time formats. Unrecognized formats will result in a date/time field being treated as String. If ambiguous data is included, such that multiple converters match all the data values, the column will also be treated as String, and a warning message will be written to the log.
Built-in Converter | Matching Data |
---|---|
ISO format with date only | 2017-08-30 |
ISO UTC date/time with optional sub-second precision (no T) | 2017-08-30 11:59:59.000Z |
ISO UTC date/time with optional sub-second precision (with T) | 2017-08-30T11:59:59.000Z |
ISO UTC two digit offset date/time with millisecond precision | 2017-08-30T11:59:59.000+04 |
ISO UTC four digit offset date/time with millisecond precision | 2017-08-30T11:59:59.000+0400 |
ISO UTC two digit offset date/time with microsecond precision | 2017-08-30T11:59:59.000123+04 |
ISO UTC four digit offset date/time with microsecond precision | 2017-08-30T11:59:59.000123+0400 |
ISO time zone date/time with optional sub-second precision | 2017-08-30T11:59:59.000 NY |
Exchange timestamp (will be converted to NY time) | 2017-08-30T11:59:59.000 |
Timestamp UTC with optional sub-second precision | 2017-08-30 11:59:59 or 2017-08-30 11:59:59.123456 |
SQL Server default format | AUG 30 2017 11:59AM |
Converters for other formats can be included by adding their details to a property file used by the schema creator process.
Here is an example converter that could be added in a property file:
SchemaCreator.converter.name.1=USDateTimeSeconds1
SchemaCreator.converter.pattern.1=^((0?\\d)|(1[0-2]))\\/(([0-2]?\\d)|(3[0-1]))\\/\\d{4} (([0-1]?\\d)|(2[0-3])):[0-5]\\d:[0-5]\\d$
SchemaCreator.converter.formula.1=new DBDateTime(new java.text.SimpleDateFormat("MM/dd/yyyy HH:mm:ss").parse($TARGETNAME$).getTime()*1000000)
SchemaCreator.converter.sourceType.1=String
SchemaCreator.converter.targetType.1=DateTime
There are five components that comprise the definition for the custom converter. All five are required. Note the format of the property names: SchemaCreator.<atrribute name>.<unique index of the converter>
. The index value must match across the five properties. The values do not have to start at 1, or be sequential, but it is important to ensure index numbers across multiple property files do not collide.
These five pieces can appear in any order and do not have to be contiguous, but all five pieces must be present if any one of them is present. If an incomplete converter entry is detected when a schema creator process starts, the process will fail and will log details about which components of the definition are missing.
name
- The name attribute is mainly for convenience in identifying a particular converter.pattern
- The pattern is a regular expression string used to match data the converter can process. It must be very specific so this converter does not overlap any other converters, and so this formula is not given any data that it cannot process. The example regex limits the month portion of strings it will match to numbers less than 13, which helps avoid matches to non-US date strings where the day would be written before the month. When trying to find a converter to use for complex data, the schema generator will try to find one and only one converter whose pattern matches all the non-empty values from the sample data. Note that all backslashes in any property values must be escaped by doubling them. In this example, the regex will match date/time strings like this one:01/11/1975 11:10:23
formula
- The formula contains a Java expression that will be used to convert the source data to the target data type. It must include the$TARGETNAME$token
, which will be replaced with the column name by the schema generator. In this case, a Java SimpleDateFormat is being used to parse the date/time string into a Java Date. Then the.getTime()
method is used to get an epoch milliseconds value for the Date. Deephaven uses epoch nanoseconds, so this value is then multiplied by one million and passed to the DBDateTime creator method to return a DBDateTime.sourceType
- ThesourceType
controls how the importer will interpret a value before passing it to the formula. It will usually be String, but if the source value was already an epoch value, for instance, then the source type might be long instead of String.targetType
- ThetargetType
must match the type produced by the formula. This is used by the schema generator to set the appropriate column type for columns that match the regex pattern.
ImportSource XML in a schema file
Additional metadata to control and modify the import process is added in an ImportSource XML block of the schema. This is demonstrated below:
<Table name="CsvNames5" namespace="Import" storageType="NestedPartitionedOnDisk">
<ImportSource name="JDBC1" type="JDBC">
<ImportColumn name="FirstName" sourceName="first_name" />
<ImportColumn name="LastName" sourceName="last_name" />
<ImportColumn name="Gender" sourceName="gender" />
<ImportColumn name="Race" sourceName="race" />
<ImportColumn name="FirstNameLength" sourceName="first_name" formula="FirstNameLength.length()" sourceType="String" />
</ImportSource>
<Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__"/>
<Column name="Date" intradayType="none" dataType="String" columnType="Partitioning" />
<Column name="FirstName" dataType="String" columnType="Normal" />
<Column name="LastName" dataType="String"columnType="Grouping" />
<Column name="Gender" dataType="String" columnType="Normal" />
<Column name="Race" dataType="String" columnType="Normal" />
<Column name="FirstNameLength" dataType="int" columnType="Normal" />
</Table>
Multiple ImportSource blocks can be added to allow various import instructions for different sources or types of import. The type attribute is required; currently supported values are "CSV", "JDBC", "JSON", and "XML". The name attribute is optional. When the import runs, if no ImportSource name is specified, the importer will use the first ImportSource block that matches its type. If no matching block is found (for type, and, if specified, for name), then the import will attempt to auto-match columns as it would if no ImportSource blocks existed.
To support multi-partition imports, you must have an ImportColumn for the partition column indicating how to partition each row (e.g., <ImportColumn name="Date" sourceName="trade_date"/>
where trade_date
is the CSV/JDBC column name).
An ImportSource block can contain multiple types of entries, including:
These options are described below:
ImportColumn
Each entry describes how to map values for one column from the source data to the target table's schema.
Attribute | Description |
---|---|
name | The name of the target column in the Deephaven table, which must exactly match the name attribute of one of this schema's Column elements. This attribute is required; others are optional. Note: An ImportColumn element with no attributes beside name provides no special handling for that column and is effectively ignored. |
sourceName | The name of the column in the source data whose values should be mapped into the target column specified by name. This is used for mapping from a different column name in the source table, such as when a source name is not a valid Deephaven column name. Also use this attribute on the partition column when configuring multi-partition imports. |
sourceType | The data type of the source data. This is used when data types must be translated during the import process. If sourceType is not specified, then the source data is expected to have the type specified in the Column element that corresponds to the ImportColumn . As an example, sourceType="long" may be required when importing a timestamp stored as a long value to a DBDateTime column. Such a conversion cannot be done implicitly and requires a sourceType to tell the importer to read the source value as a long. Note: sourceType is case-sensitive: long is a primitive long value, while Long is interpreted as java.lang.Long , an object type. |
nullable | Indicates whether null values will be accepted for this column. If the nullable attribute is present, it must be "true" or "false". False is the default. A column marked as nullable can be skipped during the import (i.e., if it doesn't exist in the source ), and be set to null in all rows. A nullable column will also receive a null value when a null value is read from the source or no value is read from the source. Note: for primitive types, which do not natively support null values, the database uses reserved values to represent null. Null default values from QueryConstants will be used. If a default is defined, the default will be used instead of a null value. |
formula | This allows specification of Java code to do simple manipulations of data based on the source value itself. In the example noted earlier, the FirstNameLength ImportColumn uses a formula to calculate its value. Note that the value being read during the import will be passed to the formula using the ImportColumn 's name. Formulas can use built-in Java methods, like replace() for strings, functions that are built into the importer classes, or functions imported from other packages (see: Imports). When compiling the field writer, the importer will enclose the formula in a try/catch block, in case methods being called within the formula require exception handling. A caught exception will be rethrown as a RuntimeException with a message indicating that the exception was caught in a formula, and with the original message appended. |
transform | Similar to formula , transform creates a new function during the import using the details of the ImportColumn element. Transforms create Java Lambda functions that are then applied to the input value. An example of a built-in transform function is dbDateTimeFromLong . This function will create a Lambda function that converts a long to a DBDateTime, and will use another attribute (transformPrecision ) to select whether the resultant function expects milliseconds, microseconds, or nanoseconds from Unix epoch. |
default | The value to use when import or parsing of the source value fails or the source is null. The default value must be type compatible with the target column type; for boolean columns, this would be lowercase "true" or "false." The literal "null" may be used for a default, and will result in an appropriately typed null value (from com.illumon.iris.db.tables.utils.QueryConstants ) for primitive types, or an actual null value for object types. A specified default takes precedence over null handling when a column is defined as nullable. In the special case where the literal null, lowercase, is desired as the default value for a String, this could be accomplished with a default that sets some known value, or null, and a formula to replace the value with the literal String null. |
class | The full name of a custom class to use as the field writer for the column. Custom field writer classes must derive from the CsvFieldWriter or JdbcFieldWriter class, as appropriate to the import type. If class is specified, formula , transform , sourceName , and sourceType are ignored and not used by the importer itself, though they may be used within the custom field writer. See: details below. |
cacheable | (CSV only) If this attribute is set to "false", caching will be disabled for this column. If this attribute is not present, or is set to "true", caching will be enabled, assuming the column meets one of the other criteria for caching (complex non-String column type, has a formula, or has a transform.) Caching is useful for values that require conversion or calculation work during the import, but may not be helpful if values in the column are rarely duplicated. |
cacheSize | (CSV only) Allows overriding the cache size for columns that will be cached. The default size is 1024 entries. |
____ | Other attributes as needed to control transform functions. (e.g., dbDateTimeFromTimestamp uses transformPrecision to control its behavior. See details below. |
Imports
The packages to import. A default set of packages is imported to support built-in functions and import tasks, but additional packages can be specified to provide other functions or support custom transforms. Imports can be just the qualified package/class name, or the full import qualified.package.name.class;
line.
Beside the internal imports used by the field writers, these imports are automatically added in the field writer context:
java.util.Properties
com.illumon.iris.db.tables.utils.DBDateTime
com.illumon.iris.db.tables.utils.DBTimeUtils
com.fishlib.util.DateUtil
java.util.regex.Pattern
java.time.format.DateTimeFormatter
java.time.*
maxError
The maximum number of allowed parsing errors before failing the import. The default is 0. Allowed errors that will increment maxError
are things like numeric conversions, overflows, and mismatched values (e.g., a Boolean value being set from a String column that doesn't match any known version of True or False.)
arrayDelimiter
A string to be used as a delimiter when parsing array data. The default delimiter is a comma.
ImportTransform
See Custom Transforms below.
name
- The name of a custom transform function, and will be the name sought when anImportColumn
specifies a custom transform function.body
- The Java code that will be compiled as part of the setter for anImportColumn
that uses the custom transform.
Custom transforms
Custom transforms can be added to the ImportSource
block as ImportTransform
elements. These elements must:
- Have a
Body
element that contains the text of the transform function. - Have a
name
attribute that matches the name used in thetransform
attribute ofImportColumns
that use the transform. - Implement a
getFunction
function that returns a function matching the specification of the interface.
At runtime, getFunction
is passed to the ImportColumnDefinition
of the column that is being set, and a boolean value indicating whether strict (true
) or permissive (false
) data checking has been selected. The ImportColumnDefinition
allows getFunction
to access details like the target column name, the target column type, and any ImportColumn
details by using getImportColumn()
, which provides access to functions such as getAttributeValue()
.
The following shows an example of a custom transform element that could be specified in the schema (Note: this is actually the built-in dbDateTimeFromLong
transform):
<ImportTransform name="Sample">
<Body>
getFunction(ImporterColumnDefinition col, boolean strict)
{
String precision = col.getImportColumn().getAttributeValue("transformPrecision");
switch (precision) {
case "millis":
return (long value) -> new DBDateTime(DBTimeUtils.millisToNanos(value));
case "micros":
return (long value) -> new DBDateTime(DBTimeUtils.microsToNanos(value));
case "nanos": return DBDateTime::new;
default: return null;
}
</Body>
</ImportTransform>
When compiling the transform function, the target, and, if defined, source, data types are used to find a suitable Java functional interface with which to declare the getFunction
. In the case of the example above, the target column type is DBDateTime, and the sourceType="long"
attribute is added to the ImportColumn
. This combination of source and target types is then used to create and execute the resultant function as a LongFunction<DBDateTime>
.
A much simpler example follows:
<ImportTransform name="Sample">
<Body>
getFunction(ImporterColumnDefinition column, Boolean strict) {
return (long value) -> (2 * value);
}
</Body>
</ImportTransform>
If this is used for a column whose type is long, this will be mapped to a LongUnaryOperator
, which takes a long and returns a Long. If the sourceType
is not specified, the importer will attempt to use the target column type for matching a function interface.
If the function being returned from getFunction
has a declared return type, this return type must exactly match the target column type for which the transform will be used. For example, a transform that will provide values for a Long column and that has a declared return type, must be declared Long, not long. However, the simpler, and recommended, implementation is to include only the transform body in the return from getFunction
, so the system can perform implicit type conversion when compiling the transform.
Custom transforms can be added from the Schema Editor File menu using the View/Edit Schema XML option.
Primitive types and boxed types in transforms
Not every combination of primitive types is supported by the Lambda interfaces used to pass transform functions into field writers. The supported combinations for unboxed primitive types are any combination of source and target of int, double, and long.
Therefore, a transform that reads a long value from the source and transforms it to a double value in the target table would use the primitive types of long and double.
However, a transform that reads a byte value and writes a short value would not be able to use primitive types. The importer will map this transform as a Function<Byte, Short>
. Note the capitalization. Byte and Short are the boxed (class-based) versions of the primitive types byte and short. This means there is extra overhead in the transform process to use the larger class-based versions of the data types.
Note: This concern with boxing of primitive types is only applicable to transforms. If the data can be implicitly converted (e.g., map a short to an int), or converted with a formula, then no boxing occurs. If the translation is too complex for a formula, another option for such primitive types is to write a custom FieldWriter
. Since the FieldWriter
classes are not interfaced through Lamdba functions, they have no limitations on which primitive data types they can use for source or target.
Another thing to be aware of is that primitive types that have a "null" default will be set to values from com.illumon.iris.db.tables.utils.QueryConstants
when their sources are null. While these may be boxed when they are passed into a transform, they will not be automatically converted to true null values. If special handling is needed for null values in the transform, either check for equality to the QueryConstants values, or use a boxed type for the sourceType
property of the column so these values will be preserved as null.
Array types
The importers support long[] and double[] array types. From the source, these are expected to be strings (either a string field from JDBC or a column value in CSV). Formatting of the string is as a bracketed list of values. Brackets can be square brackets [ ], parentheses ( ), or curly braces . The delimiter can be specified with arrayDelimiter, but the default delimiter is a comma. Empty arrays [] are also supported.
An example array of double strings follows:
[1.5, 6.4, 8.0, 2.2322, 88, 16.02]
Custom FieldWriter
classes
Note
Custom Column Setters are currently supported only with the legacy CSV importer. To use the legacy parser, set the configuration property Importer.useLegacyCsvForImport
to true
(e.g., by adding -DImporter.useLegacyCsvForImport=true
to extra JVM arguments).
Custom FieldWriter
classes can be used where a field writer needs to use values from multiple source columns, or where it is more desirable to package transform logic into a FieldWriter class rather than in an in-line transform or imported code. Custom field writers must extend the source-specific FieldWriter
class (i.e., CsvFieldWriter
for CSV imports or JdbcFieldWriter
for JDBC imports.) They must implement a specific constructor, and must also implement the processField
method.
A custom field writer class project will need the following Deephaven jars added as dependencies:
IO-<version_number>.jar
Base-<version_number>.jar
Ingesters-<version_number>.jar
DataObjects-<version_number>.jar
BinaryStore-<version_number>.jar
IrisDataObjects-<version_number>.jar
An example of a custom field writer for JDBC follows:
package io.deephaven.fieldwriters;
import java.io.IOException;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.TimeZone;
import com.fishlib.io.logger.Logger;
import com.illumon.iris.binarystore.RowSetter;
import com.illumon.iris.importers.ImporterColumnDefinition;
import com.illumon.iris.importers.JdbcFieldWriter
/**
* Sample custom field writer class that takes FirstName and LastName and writes out FirstName + ' ' + LastName
*/
public class FullNameWriter extends JdbcFieldWriter {
private final RowSetter setter;
private final ResultSet rs;
private final ImporterColumnDefinition column;
// This is the constructor format that is required for custom JdbcFieldWriters
public FullNameWriter(Logger log, boolean strict, ImporterColumnDefinition column,
final TimeZone sourceTimeZone,RowSetter setter, ResultSet rs, String delimiter) {
super(log, column.getName(), delimiter, sourceTimeZone);
this.setter = setter;
this.column = column;
this.rs = rs;
}
@Override
public void processField() throws IOException {
try {
setter.set(rs.getString("FirstName") + " " + rs.getString("LastName"));
} catch (SQLException e) {
throw new RuntimeException("Failed to concatenate FirstName and LastName for target field: " + column.getName());
}
}
}
A custom field writer for CSV would be similar, but would not include ResultSet
or TimeZone
in its constructor arguments, and would require CsvRecord
as an argument for processField
.
import com.fishlib.io.logger.Logger;
import com.illumon.iris.binarystore.RowSetter;
import org.apache.commons.csv.CSVRecord;
import java.io.IOException;
public class CsvMultiColumnExample extends CsvFieldWriter {
private final RowSetter setter;
private final ImporterColumnDefinition column;
//This is the constructor format that is required for custom CsvFieldWriters
public CsvMultiColumnExample(Logger log, boolean strict, ImporterColumnDefinition column, RowSetter setter, String delimiter) {
super(log, column.getName(), delimiter);
this.setter = setter;
this.column = column;
}
@Override
public void processField(CSVRecord record) throws IOException {
setter.set(record.get("FirstName") + " " + record.get("LastName"));
}
}
Internally, XML uses CSV import processing, so the CSV custom field writer structure would also be usable for XML imports.
To use the custom field writer in the table, the class attribute of the ImportColumn
is set, and an import entry will be needed to make the custom field writer class available to the import process:
<Imports>com.io.deephaven.fieldwriters.FullNameWriter</Imports>
...
<ImportColumn name="FullName" class="io.deephaven.fieldwriters.FullNameWriter" sourceName="FirstName" />
Caution
Note: Even though it is not used, sourceName
is required due to a known issue in ImportColumn
processing. Currently, this attribute must exist, and must indicate some other column that does exist in the data source as its value.
This can then be used to provide data for the FullName
column in the table:
<Column name="FullName" dataType="String" columnType="Normal" />
Typically, the custom field writer class will be compiled and packaged in a jar that should be placed under /etc/sysconfig/illumon.d/java_lib
on servers where data imports that need it will be run.
JdbcFieldWriters
must extendJdbcFieldWriter
and implement a constructor that accepts:Logger.class
- logger instance passed down from mainImporterColumnDefinition.class
- column definition of column to which this field writer will be attached (including ImportColumn` and target information)RowSetter.class
-rowsetter
for the target columnResultSet.class
- JDBC result set with the current row. The field writer should only read values from result set metadata and current row data, not change the position of the result set or close it.String.class
- delimiter value used for array parsing functionsBoolean.class
- strict value to indicate whether data checking is strict (true
) or relaxed (false
)
CsvFieldWriters
must extendCsvFieldWriter
and implement a constructor that accepts:Logger.class
- logger instance passed down from mainImporterColumnDefinition.class
- column definition of column to which this field writer will be attached (includingImportColumn
and target information)RowSetter.class
- row setter for the target columnString.class
- delimiter value used for array parsing functionsBoolean.class
- strict value to indicate whether data checking is strict (true
) or relaxed (false
)
Built-in functions and transforms
Some functions and transforms are built into the importer framework and can be used when designing ImportSource
entries. Transforms often require other attributes to control their behavior. Functions can be used in formulas;
transforms can be used in transforms.
Transforms
Name | Description |
---|---|
dbDateTimeFromLong | Takes a long input (offset from Unix epoch) and returns a DBDateTime. The attribute transformPrecision should be set to "millis", "micros", or "nanos" to indicate the units of the long offset value. For example, transformPrecision="millis" means that the long value from the source is the number of milliseconds from Unix epoch. Requires sourceType=long to match the expected input of the transform. |
enumFormatter | Takes an int input and returns a String, based on lookup from an enum defined in the system columns file. The attribute transformColumnSet is the String column set name to match in the columns file. The attribute transformEnum is the String name of the enum under the column set to match in the columns file. Requires sourceType=into match the expected input of the transform. |
Functions
Name | Input Type | Output Type | Description |
---|---|---|---|
DBDateTimeUtils.nanosToTime() | long | DBDateTime | This is the constructor for DBDateTime objects. By default, it takes a long offset of nanoseconds from Unix epoch. Additionally, DBDateTimeUtils.millsToTime() and DBDateTimeUtils.microsToTime() are available for parsing millisecond/microsecond offsets into timestamps. Standard Java date and time parsing and conversion functions can be used to create a long Unix epoch offset from other formats; the result of which can then be passed to this function to store as DBDateTimevalue. java.time.* and java.time.format.DateTimeFormatter are included in the standard imports. ZonedDateTime.parse can be used with various formats to parse strings into ZonedDateTimes ; and the ZonedDateTime.toInstant().toEpochMilli() call can then be used, for example, to convert to long milliseconds to pass to this creator. |
dbDateTimeFromTimestamp | java.sql.Timestamp | DBDateTime | Takes a Java Timestamp and returns a DBDateTime used internally, automatically, when importing JDBC datetime types to DBDateTime destination` columns. |
DBTimeUtils.convertDateTime | String | Date | Takes a String and returns a DBDateTime. Expects a string of the form: YYYY-MM-DDThh:mm:ss.nnnnnnnnn TZ . One use of this function would be to convert dates that had originally been exported to CSV from a Deephaven system. |
accountingInt | String | int | Takes a String and converts to an int. Converts enclosing parentheses to a negative value. Removes thousands separators. Expects US number formatting (comma for thousands separator). Requires sourceType=String to match the expected input of the function. |
accountingLong | String | long | Takes a String and converts it to a long. Converts enclosing parentheses to a negative value. Removes thousands separators. Expects US number formatting (comma for thousands separator). Requires sourceType=String to match the expected input of the function. |
accountingDouble | String | double | Takes a String and converts it to a double. Converts enclosing parentheses to a negative value. Removes thousands separators. Expects US number formatting (comma for thousands separator). Requires sourceType=String to match the expected input of the function. |
stripFirstCharacter | String | String | Takes a String and returns a String. Strips the first character, assuming the String has at least one character. No overrides of sourceType needed because input and output types match. |
parseBoolean | String | Boolean | Takes a String and returns a Boolean. Single character T /F , t /f , or 1 /0 is converted to Boolean true or false. Strings (case-insensitive) true or false converted to Boolean true or false. Used internally, automatically, when reading a CSV to a destination column that is Boolean. Requires a sourceType=String for explicit use in custom formulas. |
getLongArray | String | long[] | Takes a String and returns an array of long values. Two arguments - first is the String of array values, second is a String delimiter to parse the values. Input string is expected to be enclosed - normally by square brackets. Only single-dimensional lists (arrays) are supported; no matrices. Used internally, automatically, when importing to a long[] column. Requires sourceType=String for explicit use in custom formulas. |
getDoubleArray | String | double[] | Takes a String and returns an array of double values. Two arguments: first is the String of array values, second is a String delimiter to parse the values. Input string is expected to be enclosed normally by square brackets. Only single-dimensional lists (arrays) are supported; no matrices. Used internally, automatically, when importing to a double[] column. Requires sourceType=String for explicit use in custom formulas. |
CopyTable
schemas
Note
The CopyTable
feature applies only to System schemas. CopyTable
schemas are not supported for user schemas.
One table layout may be used for multiple tables. When this is required, it is not necessary to replicate the entire source schema definition for each new table. Instead, a schema definition can use the CopyTable
element instead of the Table
element, referring to another source table for its definition. The CopyTable
element then uses the source table's schema for all definitions, including columns, imports, loggers and listeners.
The schema file naming convention of <namespace>.<table>.schema
also applies to CopyTable
schema files. The namespace and table in the CopyTable
schema filename should be those of the duplicate table, not the source table.
A CopyTable
definition contains the following attributes:
namespace
- the duplicate table's namespacesourceNamespace
- the source table's namespacename
- the duplicate table's namesourceName
- the source table's name
Note
The previous CopyTable
syntax of namespaceSource
and nameSource
has been deprecated.
For example, the following would use the definition of the table "ExampleTable" from the namespace "ExampleNamespace" for a table called "ExampleTableCopy" in the namespace "CopyTableNamespace":
<CopyTable namespace="CopyTableNamespace" sourceNamespace="ExampleNamespace" name="ExampleTableCopy" sourceName="ExampleTable" />
A duplicate table functions in the same manner as any other Deephaven table. Data can be imported through the importer classes or real-time ingestion, partitions will be generated in the same ways, intraday data can be merged to the historical database, and queries will work as expected.
Loggers and listeners are not generated for CopyTable
definitions, as the loggers and listeners from the source table will be used for the duplicate table.
Warning
Chained CopyTable
definitions are not supported at this time; a CopyTable
must refer to a fully defined Table schema.
Managing schema files
Once schemas have been defined or changed, they must be made visible to Deephaven (deployed), and if any custom loggers and listeners are defined, these must be generated.
The logger/listener generation can be done from deployed schemas, or from another location prior to schema deployment so the generated loggers can be integrated with the customer application before being deployed to Deephaven.
Two scripts are provided for managing schema files, and are described below:
dhconfig
both deploys schemas to a Deephaven database, making their described tables available for use, and exports schemas from etcd to a specified directory. See: Configuration Tools.generate_loggers
generates the logger and listener classes from schemas. For more information on loggers and listeners see Streaming Data.
These scripts can be used during the initial setup of a new Deephaven system, as well as to update configuration when schemas are added or modified. Rerunning the scripts is safe, even if nothing has changed since the last time they were run.
Schema service types
Deephaven clusters use etcd to centralize storage of schema files, alongside additional configuration such as property files.
Schema locations and properties
On a Deephaven server, editing a schema file directly has no effect on the system. Edited schema files must be deployed before the system will use them. See below.
The following property controls aspects of schema behavior:
SchemaConfig.defaultListenerPackagePrefix
- defines the default listener package prefix used if one is not defined in a schema. Any generated listeners will use this value followed by the namespace converted to lowercase if they don't have their own package defined in their schema file. The default value is:SchemaConfig.defaultListenerPackagePrefix=com.illumon.iris.intraday.gen
Deploying schemas to Deephaven
Once schemas are defined, or after they are modified, they must be imported into Deephaven's centralized configuration
database (backed by etcd
). This is done using the dhconfig
utility, which uses the parameters defined at the command
line to find the schema files. The Schema Editor also has an option to deploy schemas.
Until this step is done, schema changes are not visible to Deephaven.
Use the following command to perform schema deployment. Note that this command must be run as a user that has access to
the Deephaven configuration server (e.g., sudo -u irisadmin
).
/usr/illumon/latest/bin/dhconfig schemas import <options>
See dhconfig schemas for more details about the command options.
For example, on most Deephaven installations, the following command will redeploy the LearnDeephaven schemas:
/usr/illumon/latest/bin/dhconfig schemas import --directory /etc/sysconfig/illumon.d/schema/LearnDeephaven --force <auth options>
This will result in the following output:
Updating System.LearnDeephaven.EODTrades
Updating System.LearnDeephaven.StockQuotes
Updating System.LearnDeephaven.StockTrades
Migration Mode
When running the legacy deploy_schema
tool in migration mode, if directories or files are not specified, the tool will instead use several properties to locate and process schema files. This is intended to be run one time, to import schema files from their legacy locations into etcd. Once this migration is complete, these properties will no longer be used by Deephaven to find schema files.
These legacy schema location properties include a list of locations (directories or JAR files) in which schemas can be found, delimited by semicolons or spaces. When looking for schema files, Deephaven (including the deployment and logger/listener generation scripts) will search all subdirectories of the specified locations for files with the .schema extension, as well as looking within JAR files in those directories and subdirectories for .schema files.
The schema location properties allow the following substitutions:
<devroot>
- this refers to the installation directory of the Deephaven software, usually/usr/illumon/latest
.<workspace>
- this refers to the workspace of the process.
The following properties specify the locations of schema files during migration mode:
SchemaConfig.allowedNamespaces
- If this property is set when deploying schemas, then schema files will be imported only for the namespaces specified, and all specified namespaces will be created even if no schema files are found. The value is a comma delimited list of namespaces.- The
SchemaConfig.resourcePath
is a multi-valued property prefix which tells the system where to find schema files. You may add any suffix to the property name to add new paths without overwriting any other configuredSchemaConfig.resourcePath
values. You may add more than one directory by using a semicolon separator; you may also use wildcard matching by using an asterisk*
(note that when using wildcards, your paths must end with a trailing slash). For example:SchemaConfig.resourcePath.irisInternal
- The configured schema directory for internal Deephaven schemas.SchemaConfig.resourcePath.customer
- By default, this path is the customer's schema directory:/etc/sysconfig/illumon.d/schema
. If schemas will be placed in additional locations, you may freely add to this property, or create some other suffix with the additional locations.SchemaConfig.resourcePath.plugins
- By default, this is a list of schemas for Deephaven plugins. If new plugins are added which have schema files, you may freely add to this property, or create some other suffix for the new plugin schema locations. If this property does not exist (on older installations), you should add it with a value of/etc/sysconfig/illumon.d/plugins/*/schema/
. Note the trailing slash.
Deploying schemas during Intraday Data ingestion
Schema changes made while intraday data is being written to a table (for instance, through the tailer and data import server) will require special attention as Deephaven does not automatically handle schema changes while data is being ingested. This can be accomplished using the following procedure.
-
First, ensure the tailer is not sending data for the changed table by examining its log file (usually
/var/log/deephaven/tailer/LogtailerMain1.log.current
), and stop the tailer. -
Move or rename any existing binary log files for that table for the day so they won't be found by the tailer. For example if the
UpdatePerformanceLog
schema was being changed:sudo mkdir /db/TempFiles/save sudo mv /var/log/deephaven/binlogs/perflogs/DbInternal.UpdatePerformanceLog*.bin.* /db/TempFiles/save
-
Assuming the table's internal partition is determined by the tailer configuration, update the tailer's XML configuration file and change (or add) the
internalPartitionSuffix
attribute for the changed table so that new intraday data will be written to a different location. If the internal partition is instead determined by theinternalPartitionRegex
attribute, update the logger to ensure it's logging with a different filename. For example:internalPartitionSuffix="TableVersion2"
-
If the new schema includes listener changes, the Data Import Server will need to be restarted so it can use the new listener classes.
-
Finally, once the new schema is deployed, the tailer can be restarted.
Note
Queries may not find the previously written intraday data for the day after this procedure is complete, but merges will find and write it correctly.
Editing schemas of existing tables
Once a table has data, there are multiple caveats regarding editing its schema:
- Adding a new column to schema will show values as null when viewing the table for partitions that already existed at the time the column was added.
- Removing a column from schema does not by itself remove data from disk, but the column will no longer be visible in queries.
- Schema updates should normally be deployed after hours. If a schema update is applied mid-day, there are two options to maintain compatibility (see mid-day deployment details above):
-
- Create a new internal partition to separate the data written with the "old" schema from new data being written with the "new" schema.
-
- Add logFormat attributes to Listener sections of the schema as well as mappings so "old" version log files can populate the "new" version schema colunms.
- Changing the datatype of a column requires rewriting the data in that column for all partitions.
- Loggers that stream data to a table must match the schema in terms of which columns they provide.
Editing schemas with ColumnTools
Note
This feature is available starting in release 20210401.165
ColumnTools is a set of classes that automate and simplify schema editing tasks for tables that already have data. The ColumnTools class provides methods to modify or remove historical column data. Currently it supports only Deephaven format historical data, as Parquet files are not as conducive to replacing or modifying individual columns.
The main activities that are possible using columns tools are:
- Update the contents of existing historical columns
- Populate a new column for existing historical partitions
- Change the data type of an existing column
- Remove a column and its data files
- Rename a column and its data files
Like other data writing operations (mergeData, etc), ColumnTools methods must be executed from the context of a merge server worker, so the underlying process will have write access to historical data files.
General considerations
When processing updates, ColumnTools methods will check several prerequisites. These are principally:
- All partitions to be updated are using Deephaven format
- All internal partitions to be updated are writable (see the Table Storage topic for details on writable vs read-only internal partitions)
- The column being updated or deleted in not a grouping or partitioning column
- The data type of new values matches the data type of the column being written
For multi-partition updates the ColumnTools class will first write all of the new column files to hidden temporary directories and then, when all the writing is complete, move the files to their permanent locations. The files' locations and their destinations are persisted to a state file on disk when all the writing is complete. This state file is also used to prevent multiple concurrent column changes to a table. See the troubleshooting section for steps to recover or cancel an update that was interrupted or failed after creating its state file.
Caution
No other write operations should be occurring on a historical table concurrently with ColumnTools modifications. This mainly means merge operations. ColumnTools changes should only be run at a time when they will not risk interacting with scheduled or ad-hoc merges of the table in question.
ColumnTools updates historical data and table schema, but does not update intraday data. If there is also unmerged intraday data, this data will be unreadable after changing a column data type. ColumnTools changes to tables that have intraday data should be done while related ingesters/tailers/loggers are offline. In some cases it will be necessary to delete and reimport intraday data after modifying a table with ColumnTools.
After modifying a table with ColumnTools, workers may have out-of-date details about the table. A refresh using db.reloadData()
may be sufficient, but, in some cases, workers may need to be restarted to pick up new data and schema details.
Initializing the ColumnTools
class
To use ColumnTools, the implementation class must be imported and a new instance instantiated. In Groovy:
import com.illumon.iris.importers.ColumnToolsImpl;
columnTools = new ColumnToolsImpl(log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db);
This is a basic instantiation. It will create a ColumnTools class that uses the logger, fatal error reporter, and database from the worker environment.
A more complex instantiation would be something like:
import com.illumon.iris.importers.ColumnToolsImpl;
columnTools = new ColumnToolsImpl(log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db,
null,
10);
This example passes two extra arguments.
The first (null) is an instance of com.illumon.util.progress.StatusCallback. This allows the ColumnTools methods to send progress updates as they write columns. In this case, though, since null is passed, a non-reporting status callback is created internally.
The second is more interesting. This defines how many threads to allocate for concurrent write operations. For maximum throughput, this should be equal to the number of internal partitions that exist for the table being updated. This is for column write operations. Other operations, like deleting column files, happen on a single thread, because they do not take much time.
Use from Python requires a few extra steps. The logger that is automatically available in Groovy must be explicitly created in Python, jpy.get_type() is used to import classes, and also None is used instead of null:
ColumnToolsImpl = jpy.get_type("com.illumon.iris.importers.ColumnToolsImpl")
Level = jpy.get_type("com.fishlib.io.log.LogLevel").INFO
Logger = jpy.get_type("com.fishlib.io.logger.NullLoggerImpl")
log = Logger(Level)
fatalErrorReporter = \
jpy.get_type("com.fishlib.util.process.ProcessEnvironment").getGlobalFatalErrorReporter()
columnTools = ColumnToolsImpl(log, fatalErrorReporter, db, None, 10)
Updating the contents of existing columns
The two activities - updating the contents of a column and populating a new column are the same process with ColumnTools. It is only the scenarios that are different. When a new column is added to a table that already has historical data, that column will be null for partitions that have already been written. The updateColumn
method can be used to fill the column for historical partitions. Similarly, if there is a column that has data that needs to be updated - or corrected, the updateColumn
method can do that as well.
updateColumn
has three user-facing forms:
public void updateColumn(@NotNull final String namespace,
@NotNull final String tableName,
@NotNull final String partitioningColumnValue,
@NotNull final String columnName,
@NotNull final String updateStatement)
public void updateColumn(@NotNull final String namespace,
@NotNull final String tableName,
@NotNull final String columnName,
@NotNull final String updateStatement)
public void updateColumn(@NotNull final String namespace,
@NotNull final String tableName,
@NotNull final String partitioningColumnValue,
@NotNull final String columnName,
@NotNull final Table sourceTable)
The first form allows specifying a namespace, table name, column name, and the partitioning column value (typically Date) for which an update should be run. The last argument is a Deephaven updateView
expression that will be executed to provide the new values for the column.
Example:
columnTools.updateColumn("NS1,"Sample","Value2","Value2=Value1*2");
This example would update the Sample table in the NS1 namespace to set the Value2 column to the result of Value1 times 2.
The second form omits the partitioningColumnValue argument. This form will instead update all historical partitions that exist for this table.
The third form takes a sourceTable instead of an updateStatement. This form is for use when the logic to produce values for the column being updated is more complex than what can be done in a simple updateView
. To use this form, the user first calls getSourceTable
to retrieve a version of the table with internal partitioning information. A table can then be created from this one, using any capabilities within the Deephaven query language (joins, function calls, etc), and is passed back to updateColumn
to provide the source for the new values.
public Table getSourceTable(@NotNull final String namespace,
@NotNull final String tableName,
@NotNull final String partitioningColumnValue)
The sourceTable passed into updateColumn
must meet several requirements relative to the original provided by getSourceTable
:
- The internal partition column (named
X_Internal_Partition_X
by default) must be preserved - The row count must be the same
- The number of internal partitions must be the same
- The number of rows per internal partition must be the same
- The data cannot have been re-sorted
- It must contain the column being updated
- The data type of the column being updated must be unchanged
Most of these requirements are checked by the updateColumn
method call, and an exception will be thrown when one of them is found not to be met.
A simple example of getting a source table:
import com.illumon.iris.importers.ColumnToolsImpl
columnTools = new ColumnToolsImpl(
log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db,
null,
10)
st = columnTools.getSourceTable("LearnDeephaven","StockQuotes","2017-08-25")
This example returns a source table which can then be updated with complex table operations such as joins to populate columns with data to be used in subsequent updateColumn calls.
To update an entire historical table with logic that requires the use of getSourceTable
, it will be necessary to write a loop that iterates over selectDistinct(<partitioningColumn>)
values from the table, gets a source table, operates on it, and then calls updateColumn
.
Outline of a looping source table updateColumn call:
import com.illumon.iris.importers.ColumnToolsImpl
// This example is partial, and assumes an AdjPrice column has been added to the schema
String namespace = "LearnDeephaven"
String tableName = "StockTrades"
String columnToUpdate = "AdjPrice"
columnTools = new ColumnToolsImpl(
log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db,
null,
10)
dates = db.t(namespace, tableName).selectDistinct("Date").getColumn(0).getDirect()
for (String date in dates) {
Table st = columnTools.getSourceTable("LearnDeephaven","StockQuotes",date)
// Do some table operations on st to calculate AdjPrice values
// Update a column in the current date parition from the modified version in st
columnTools.updateColumn(
namespace,
tableName,
date,
columnToUpdate,
st)
}
Changing the data type of a column
This method is meant to be used for cases where existing data can be converted into a new type using a simple conversion expression (e.g. int to long, or long to int where all values are small enough for int). If more complex table operations are needed, it may be necessary to add a new column, populate that using the sourceTable form of updateColumn
, and then remove the old column with deleteColumn
.
Internally, when changing a column data type, ColumnTools iterates through the existing historical partitions and writes temporary column files for the new format. When all files have been written, the files are then moved to replace the previous ones that used the old data type, and, lastly, the schema is updated to reflect the new data type.
public void changeColumnType(@NotNull final String namespace,
@NotNull final String tableName,
@NotNull final String columnName,
@NotNull final Class<?> newDataType,
@NotNull final String updateStatement)
Example:
import com.illumon.iris.importers.ColumnToolsImpl;
columnTools = new ColumnToolsImpl(log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db,
null,
10);
columnTools.changeColumnType("NS1","Sample","Value2",int.class,"Value2=(int)Value2");
This example changes the Value2 column of the Sample table in the NS1 namespace to the int primitive type. The fourth argument (newDataType) is a Java class of the new type. The fifth argument (updateStatement) is similar to the updateStatement argument used in updateColumn
. This expression must provide a value that matches the data type to which the column is being changed. In some cases a simple cast will work - as in this example. In others, more complex expressions may be needed - for instance to convert a String value to a numeric or DateTime one.
Deleting a column
The deleteColumn
method removes column data files from existing historical partitions and removes the column from the schema. This is a simpler and faster process than changing a column's data type, and it runs steps in a different order. Since the column will be removed regardless, the method first removes the column from the schema, and then iterates through the existing historical partitions to delete the column files. This method call will be rejected if there is already a columnUpdate running for the table.
Note that if this method is executed against a column that does not exist in the schema, it will still run and try to remove any data files on disk associated with the column. This allows this method to be easily rerun in case it is interrupted, and also allows it to be used to clean up column files left over after a column has been previously removed from schema using other methods.
public void deleteColumn(@NotNull final String namespace,
@NotNull final String tableName,
@NotNull final String columnName)
Example:
import com.illumon.iris.importers.ColumnToolsImpl;
columnTools = new ColumnToolsImpl(
log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db);
columnTools.deleteColumn("NS1","Sample","Value2");
This example deletes the Value2 column from the Sample table in the NS1 namespace, and also deletes all existing historical column files for this column.
Renaming a column
The renameColumn
method changes a column name in the schema and for existing historical column files on disk. Like deleteColumn
, this method updates the schema first and then modifies column files, since this should be a much faster process than writing new files. In case data is accessed from a table while not all files have yet been renamed, partitions that still have the old names for column files will return null values. As it is for other ColumnTools operations, though, it is strongly recommended to have no other processes attempting to read data from the table while renaming a column.
The new column name must be unique to the table and a valid Deephaven column name. If it is not, an IllegalArgumentException will be thrown with details of the failure in validating the new column name.
renameColumn
will also rename Logger/Listener columns and system inputs, and ImportSource columns. Note, however, that if an ImportSource column being renamed includes a formula, that formula will need to be updated if it includes any references to the old column name.
Note
Because renaming a column also renames corresponding Logger/Listener entries, log files created with the pre-rename version of the schema will not be replayable or loadable (e.g. with readBin) with the post-rename schema. To allow support for older log files, manually add a versioned Listener section that includes mappings from pre to post rename columns.
public void renameColumn(@NotNull final String namespace,
@NotNull final String tableName,
@NotNull final String columnName,
@NotNull final String newColumnName)
Example:
import com.illumon.iris.importers.ColumnToolsImpl;
columnTools = new ColumnToolsImpl(
log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db);
columnTools.renameColumn("NS1","Sample","Value2","SomeNewColumnName");
This example renames the Value2 column, and its corresponding historical column data files, to SomeNewColumnName.
Troubleshooting and recovery steps
Most operations through ColumnTools will validate state and data before making changes. Problems such as invalid column names and most types of malformed input in a sourceTable will throw exceptions that provide clear details about why the request could not be processed.
An IllegalStateException, with little additional detail about what caused the exception, may occur if ColumnTools operations are attempted from a query worker instead of a merge worker, because query workers do not have write access to column files.
One erroneous scenario which cannot be detected automatically is that if a sourceTable reorders rows for an updateColumn (e.g. via a sort or aggregation), then the updateColumn will run, but the updated column will have incorrect values. The system ignores the other columns in the sourceTable which differ from the table on disk, therefore the newly constructed table will have mismatched values.
When processing an updateColumn for all partitions, or changeColumnType, the execution proceeds in these phases:
- Write a state file
<table_name>_ColumnUpdate.map
to the root of the historical namespace typically/db/Systems/<namespace>/
- Write new column files for all partitions to hidden temporary directories
- Update the state file with a map of temporary and permanent directories for the table
- Iterate through all partitions from the map in the state file and replace original column files with the new ones from the hidden temporary directories
- In the case of changing column type: update the schema with the new type
- Delete the state file
All columnUpdate
forms use the state file on disk to prevent multiple concurrent ColumnTools updates to a table. renameColumn and deleteColumn do not create a state file, but will reject a request if a state file is found to exist when they start.
In the case where the process is interrupted in step 2, nothing has yet changed for the permanent table, and a fresh execution of the update/change statement can start over from the beginning. In the case where the process is interrupted in step 4, some files may have already been replaced, and the overall change should be resumed at step 4 to complete the update.
The restartFinalization method is used to restart steps 4 - 6, or, in the case when the state file is empty, to remove it so a new update can be started.
public void restartFinalization(@NotNull final String namespace, @NotNull final String tableName)
Other ColumnTools change methods, like renameColumn
and deleteColumn
, are designed to be idempotent in case they are interrupted while making file system changes. When they are re-run after interruption, renameColumn will attempt to find and rename any remaining column files with the old name, and deleteColumn
will attempt to find and delete any remaining column files for the column that has been deleted.
In addition to failure events, which result in thrown exceptions, the ColumnTools methods log a fair amount of step detail as Info messages. These can be viewed in the ProcessEventLog, or by enabling Info logging in the Log panel of a console.
In the most severe failure scenarios, a fatal error may occur in the column writing code; this will kill the worker. More detail can be obtained about such a failure by querying the ProcessEventLog for the worker in question and sorting by Timestamp (since the fatal error reporter events will likely be out of order relative to regular worker events in the log). Events of interest in this case will typically be Errors, and will be very near the max Timestamp values.
Exporting schemas from Deephaven
The dhconfig schemas export
command exports schemas from etcd to a specified directory as .schema files, named <namespace>.<table name>.schema
. These files can then be examined and edited, and re-imported with dhconfig import
.
For example, the following command will export the LearnDeephaven schema files to the directory /tmp/export
, overwriting any files that are already there.
/usr/illumon/latest/bin/dhconfig schema export --directory /tmp/export --namespace LearnDeephaven --force
Listing schemas from Deephaven
The dhconfig schemas list
command also has an option to simply list schemas rather than export them.
For example, the following command will list the LearnDeephaven schema files.
/usr/illumon/latest/bin/dhconfig schema list --namespace LearnDeephaven