Tables and schemas

A Deephaven table schema is an XML document that describes a table. A schema must include a table's namespace, name, column names and data types. In addition to specifying the structure of the data, a schema can include directives controlling how data is imported and stored.

Schema files can be created by directly constructing XML files (e.g., using a text editor), or can be generated using built-in tools. In most cases, the easiest way to create or edit a schema file is by using the Deephaven Schema Editor tool. The Schema Editor itself can be used to create a new schema from scratch, edit an existing schema, or derive a new schema from example data.

If a schema is created manually, this is typically done by picking an existing schema file to use as a starting point, and then copying, pasting, and modifying it to create a suitable layout for the new table. This does not require any special tools, but it can be tedious for complex tables, or when many tables are needed. To simplify this process, Deephaven includes tools that can analyze data sets and infer column details and import instructions, and that can also generate and validate sections and entries within schema files. The schema creation tools make schema creation faster and easier, and help avoid issues related to formatting and syntax. These tools are described in the schema inference documentation.

Once a schema is defined, it must be deployed to take effect. If the table is intended for use by streaming data, an additional step is required to generate loggers.

Note

See: This part of the documentation covers general schema requirements and features. Additional optional sections of schema are detailed in Importing Batch Data and Importing Streaming Data.

Basic Table Attributes

Metadata is imported to the database using XML-based schema files that contain definitions like the following examples:

<Table namespace="ExampleNamespace" name="ExampleTable" storageType="NestedPartitionedOnDisk">
   <Partitions keyFormula="${autobalance_single}"/>
   <Column name="Date" dataType="String" columnType="Partitioning" />
   <Column name="Alpha" dataType="String" columnType="Grouping"/>
   <Column name="Bravo" dataType="byte" />
   <Column name="Charlie" dataType="int" />
   <Column name="Delta" dataType="double" />
   <Column name="Echo" dataType="java.util.List" />
</Table>
<Table namespace="Ise" name="Quotes" storageType="NestedPartitionedOnDisk" defaultMergeFormat="DeephavenV1">
   <Partitions keyFormula="${autobalance_by_first_grouping_column}"/>
   <Column name="Date" dataType="String" columnType="Partitioning" />
   <Column name="Ticker" dataType="String" columnType="Grouping" />
   <Column name="Size" dataType="int" columnType="Normal" />
   <Column name="Price" dataType="double" columnType="Normal" />
</Table>
<Table namespace="Korea" name="Securities" storageType="SplayedOnDisk">
   <Column name="UnderlyingId" dataType="int" columnType="Grouping" />
   <Column name="Id" dataType="int" columnType="Normal" />
   <Column name="Ticker" dataType="String" columnType="Normal" />
   <Column name="SecurityType" dataType="String" columnType="Normal" />
   <Column name="Expiration" dataType="DateTime" columnType="Normal" />
</Table>

Each schema file contains definitions for exactly one table, and must follow the naming convention <namespace>.<table>.schema. For example, the above examples would be named "Ise.Quotes.schema" and "Korea.Securities.schema".

The following attributes are necessary for all Table elements:

AttributeDescription
namespaceThe system namespace (e.g., Ise or Korea) to which this table belongs. A namespace is analogous to a folder on the filesystem — they are used to organize the data, and one namespace typically contains many tables (just as a folder typically contains many files).

The namespace and table names must start with a letter, underscore, or dollar sign. The remainder of the name can include numbers, plus or minus signs, or an at sign.

nameThe name of this table (e.g., Quotes or Securities). The namespace and table names must start with a letter, underscore, or dollar sign. The remainder of the name can include numbers, plus or minus signs, or an at sign.
storageTypeSpecifies the way data for this table will be organized on disk. Options:
  • NestedPartitionedOnDisk - a hierarchically-partitioned format, suitable for nearly all use cases (the remainder of this document applies primarily to this format).
  • SplayedOnDisk - a table that exists in exactly one location, without any of the features enabled by partitioning (near real-time import, storage load balancing, etc).
  • Extended - a Table in one of the supported Extended Layouts.
defaultMergeFormatSpecifies the format in which merged (historical) data will be written.
  • DeephavenV1 - the data will be written in the proprietary Deephaven format.
  • Parquet - the data will be written in Parquet format.

Columns

The <Column> elements contain the detailed information to define each table's columns, including the column's name, data type, and role as a partitioning or grouping column.

These elements may contain the following attributes:

AttributeOptional/RequiredDefaultDescription
nameRequiredN/AThe name of the column as it will be stored in the database. The column names in a schema must be a valid Java identifier, other than one of the following reserved values: i, ii, k, in and not. This is necessary for the engine's formula parser to identify them properly.
dataTypeRequiredN/AThe type of the data stored in a column. This can be one of the following:
  • A Java class name, including the package. Of particular note are BigDecimal, BigInteger and byte[], since these types are supported with special codecs (see below).
  • One of the following type names:
    • DateTime — A date and time, stored as a 64-bit integer representing nanoseconds since the epoch. When read in a Legacy worker, the column is represented as a DBDateTime object. When loaded in a Community worker, the column is represented as an Instant.
    • String — A Java String.
    • StringSet — An efficient Deephaven collection for storing multiple Strings. Note that StringSets are limited to 64 values per internal partition and these values cannot be changed once the partition is created.
    • Boolean — A Java Boolean object. In order to support null values, boolean values are stored with Java's boxed boolean type (java.lang.Boolean), not the primitive type (boolean).
    • byte — A Java primitive byte.
    • char — A Java primitive char.
    • double — A Java primitive double.
    • float — A Java primitive float.
    • integer — A Java primitive int.
    • long — A Java primitive long.
    • short — A Java primitive short.
    • An array of any of the above data types, specified with square brackets, for example, short[].
Note: Deephaven also supports other friendly names for some types; for instance, Integer will be automatically treated as int. This treatment does not apply, however, to array types. An array of 32-bit values must be declared as int[].
columnTypeOptionalNormalThe role of the column in the table; one of the following:
  • Normal — A standard column with no special attributes.
  • Partitioning — A column whose value organizes data in the file system. The data for each partition is stored in a separate directory, named for the partition value. The partitioning column is not written as a column file to disk, since its value can be inferred from the directory name.

    Note that Every schema with storageType="NestedPartitionedOnDisk" must specify a partitioning column. Partitioning columns must have dataType="String".
  • Grouping — A grouping column used when reordering data during the merge process.
See Partition Key Formulas for details on the function of Partitioning and Grouping columns.
objectCodecOptionalNoneA Java class that controls how the column is stored on disk. See below for details on the provided codecs:
  • com.illumon.util.codec.BigDecimalCodec
  • com.illumon.util.codec.BigIntegerCodec
  • com.illumon.util.codec.ByteArrayCodec
objectCodecArgumentsOptionalNoneArgument(s) provided to the objectCodec. How this value is interpreted depends on which codec is specified.
symbolTableOptionalColumnLocationFor String columns only. This specifies whether or not a symbol table will be used to store the data. Supported values are:
  • ColumnLocation - for tables with a limited number of possible unique values, this stores the unique values one time and stores only a reference to the values. This can save space in the table.
  • None - for columns with a large number of unique values, this stores each value as its own entry in the column.
If it is not provided, the value of ColumnLocation is used. See: Symbol Tables
encodingOptionalISO_8859_1For String columns only. This determines the encoding used to store the value. Supported values are:
  • US_ASCII
  • ISO_8859_1
  • UTF_8
  • UTF_16BE
  • UTF_16LE
  • UTF_16
If it is not provided, a default encoding of ISO_8859_1 is used.

Column Codecs

Column codecs are used to provide custom deserialization logic for complex data types.

Codecs convert between raw bytes (i.e., the Java type byte[]) and the column's data type (as specified by the dataType attribute). The output type of the codec must match the column's type. Additional codecs can be provided by adding custom implementations of the com.illumon.util.codec.ObjectCodec<TYPE> interface to the classpath (for example, in /etc/sysconfig/illumon.d/java_lib).

Deephaven provides codecs for some common data types such as java.lang.BigDecimal. Using these codecs is significantly more space-efficient than relying on default Java serialization. For a list of all provided column codecs, see column codecs.

The following example schema includes columns using the provided BigDecimalCodec, BigIntegerCodec and ByteArrayCodec codecs:

<Table name="CodecsExampleTable" namespace="ExampleNamespace" storageType="NestedPartitionedOnDisk">
   <Partitions keyFormula="${autobalance_single}"/>
   <Column name="Date" dataType="String" columnType="Partitioning" />
   <Column name="Alpha" dataType="String" columnType="Grouping"/>
   <Column name="Bravo" dataType="int" />
   <Column name="Charlie" dataType="double" />
   <Column name="Delta" dataType="double" />
   <Column name="Echo" dataType="StringSet" />
   <Column name="Foxtrot" dataType="java.math.BigDecimal" objectCodec="com.illumon.util.codec.BigDecimalCodec" objectCodecArguments="13,7"/>
   <Column name="Golf" dataType="java.math.BigInteger" objectCodec="com.illumon.util.codec.BigIntegerCodec" objectCodecArguments="13"/>
   <Column name="Hotel" dataType="byte[]" objectCodec="com.illumon.util.codec.ByteArrayCodec" objectCodecArguments="32,notnull"/>
</Table>

For streaming data ingestion, the schema must also be updated so that the loggers and listeners use the codec. Please see the section on Column codecs in loggers and listeners for more details.

CopyTable schemas

Note

The CopyTable feature applies only to System schemas. CopyTable schemas are not supported for user schemas.

One table layout may be used for multiple tables. When this is required, it is not necessary to replicate the entire source schema definition for each new table. Instead, a schema definition can use the CopyTable element instead of the Table element, referring to another source table for its definition. The CopyTable element then uses the source table's schema for all definitions, including columns, imports, loggers and listeners.

The schema file naming convention of <namespace>.<table>.schema also applies to CopyTable schema files. The namespace and table in the CopyTable schema filename should be those of the duplicate table, not the source table.

A CopyTable definition contains the following attributes:

  • namespace - the duplicate table's namespace
  • sourceNamespace - the source table's namespace
  • name - the duplicate table's name
  • sourceName - the source table's name

Note

The previous CopyTable syntax of namespaceSource and nameSource has been deprecated.

For example, the following would use the definition of the table "ExampleTable" from the namespace "ExampleNamespace" for a table called "ExampleTableCopy" in the namespace "CopyTableNamespace":

<CopyTable namespace="CopyTableNamespace" sourceNamespace="ExampleNamespace" name="ExampleTableCopy" sourceName="ExampleTable" />

A duplicate table functions in the same manner as any other Deephaven table. Data can be imported through the importer classes or real-time ingestion, partitions will be generated in the same ways, intraday data can be merged to the historical database, and queries will work as expected.

Loggers and listeners are not generated for CopyTable definitions, as the loggers and listeners from the source table will be used for the duplicate table.

Warning

Chained CopyTable definitions are not supported at this time; a CopyTable must refer to a fully defined Table schema.

Table storage

Partitions

Deephaven has two distinct uses of the term "partition" as it applies to schemas. Both are required for partitioned tables (i.e., those using the NestedPartitionedOnDisk storage type).

  • A partitioning column is a String column that must contain an identifiable value for each set of data appended to the table. This is required for all partitioned tables.
  • The historical storage partitions, along with the partitioning key formula, are used to share table data across multiple storage locations as the data is merged into the historical database. This is required for all partitioned tables.

Typically, date strings (such as 2023-01-31) are used for the partitioning column value, but any string that meaningfully partitions the data can be used. Table data is stored on disk in separate directories based on the partitioning column value, which allows queries to filter data very efficiently based on partition. For example, if a table is partitioned by date, then a filter such as .where("Date=`2023-01-31`", "Sym=`AAPL`") would only need to read data from directories for the date 2023-01-31. Since most queries must filter by date when initially retrieving data, it is generally recommended to use dates as partitioning columns. Further, the ISO 8601 date format of yyyy-MM-dd (e.g., 2023-01-31) is recommended. Unlike other date formats, the yyyy-MM-dd format can easily be sorted because its lexicographical order matches its chronological order.

Partitioning columns are specified by using the columnType="Partitioning" attribute in a column definition. Generally, partitioned tables should have exactly one such column. Further information on column definitions is provided here.

When writing merged historical table data to disk, Deephaven will place partitions of the input data in a directory identified by the namespace, storage partition, column partition, and table name.

Partition Key Formulas

Historical storage partitions correspond to directories available for historical storage on the disk; for scaling and performance benefits, these are normally mapped to different physical storage devices, whether local or network-connected.

Historical storage partitions are specified using the Partitions element, which uses the keyFormula attribute to map table rows to the available storage partitions. Specifically, this formula is used by merge operations in an updateView operation, the result of which is used to assign rows to a storage partition .

The following QueryScope parameters are available as input to the formula, in addition to the columns of the table:

ParameterDescription
__WRITABLE_PARTITIONS__An array of partition name strings corresponding to directories found at /db/Systems/Namespace/WritablePartitions.
__NUM_WRITABLE_PARTITIONS__The number of available partitions (i.e., __WRITABLE_PARTITIONS__.length)
__NAMESPACE__A string representing the namespace.
__TABLE_NAME__A string representing the table name.
__NS_TN__A string representing the namespace and table name, joined with slash as a separator (e.g., Ise/Quotes or Korea/Securities)
__PARTITIONING_COLUMN_VALUE__The partitioning column value for this merge operation.
__DAY_OF_YEAR__The day of the year, only available when __PARTITIONING_COLUMN_VALUE__ appears to be a date in YYYY-MM-DD form.
__PARTITION_AUTOBALANCE_SINGLE__A complete formula that automatically chooses a single partition for all data based on __NS_TN__ and __PARTITIONING_COLUMN_VALUE__, deprecated in favor of ${autobalance_single}.

In addition, the following substitutions will be applied, in order to make writing appropriate partitioning formulas easier:

TokenSubstitution Description
${autobalance_single}A complete formula that chooses a single partition for all input data, attempting to achieve balanced partition usage over long time horizons. The preferred way to write a formula that only refers to the __PARTITION_AUTOBALANCE_SINGLE__ query scope parameter.
${autobalance_by_first_grouping_column}A complete formula that uses the first grouping column (if there is one) in order to spread data across all available partitions, attempting to achieve balanced partition usage over long time horizons. Assumes the first grouping column is the most significant, and keeps all grouped data together. Falls back to the same behavior as ${autobalance_single} if there are no grouping columns.
${autobalance_by_all_grouping_columns}A complete formula that uses all grouping columns (if there are any) in order to spread data across all available partitions, attempting to achieve balanced partition usage over long time horizons. Effectively spreads grouped data across multiple partitions, which may be useful for queries. Falls back to the same behavior as ${autobalance_single} if there are no grouping columns.
${first_grouping_column}A string representing the name of the first grouping column (i.e., the first column definition with the attribute columnType="Grouping"). Only available if there are grouping columns.
${all_grouping_columns}An array of strings representing the names of the grouping columns (i.e., column definitions with the attribute columnType="Grouping"). Only available if there are grouping columns.

Conventions and Recommendations

Some common conventions follow:

  • Large Tables: Tables with large amounts of data per column partition typically try to balance across all available storage partitions, while keeping related data together, often choosing to use a grouping column as part of the formula to achieve this. The easiest way to achieve this is to use the following formula, relying on token substitution: ${autobalance_by_first_grouping_column}
  • For an expanded example, the Ise/Quotes table above might use the following formula: __WRITABLE_PARTITIONS__[abs(Objects.hash(__NS_TN__, Ticker) + __DAY_OF_YEAR__) % __NUM_WRITABLE_PARTITIONS__]
    This serves to distribute data with the same value for the Ticker column together, while spreading data across all available partitions, taking into account day of year, namespace, and table name in order to approximate long term load balancing.
  • Small tables: Tables with small amounts of data per column partition often simply use ${autobalance_single}, keeping data in a relatively compact format while utilizing all historical partitions over longer time horizons in order to approximate load balancing.

Merge Attributes

When merging data to historical partitions, Deephaven supports writing to traditional Deephaven files as well as writing to Parquet files. When writing to Parquet, a default compression codec can be chosen by adding a MergeAttributes element with an appropriate Parquet-supported codec. This codec can be changed by specifying a different one in the merge operation. If no codec is chosen, then the SNAPPY codec is used. This element does not apply to merges in Deephaven format.

Note

See: Parquet Codecs

For example, to use the uncompressed codec:

<MergeAttributes format="Parquet" codec="UNCOMPRESSED" />

Data Indexes

In addition to Grouping, Deephaven supports the traditional concept of Indexing historical data. A Data Index is a table on disk which maps each unique key, defined by the set of Index columns, to the indices within the table at which they appear. These tables allow the Deephaven Query Engine to optimize certain operations such as Filtering, Aggregating, and Joining data.

Data Indexes are defined using the <DataIndex columns='...'> XML tag in the schema. Each <DataIndex...> entry directs the Merge process to group the data being merged by the set of columns (called keys) in the columns attribute. The Data Index is then written to disk, sorted in the same order as the columns are specified in the <DataIndex...> tag. Choosing an ordering of these columns based on how they are expected to be used in queries can improve the performance of operations that can take advantage of Data Indexes.

For example, the following snippet defines two separate Data Indexes, one on USym, and one on USym and Exchange:

<DataIndex columns="USym" />
<DataIndex columns="USym,Exchange" />

Extended Layouts

Note

Extended Layouts are only supported in Core+ workers.

Extended Layouts let you store things in the historical store using more complex structures. Extended Layouts also allow you to use multiple partitioning columns. To configure a table to use an Extended Layout, set the storageType attribute of the Table element to extended, and add an ExtendedStorage XML element to specify the type:

<Table name="ExtendedType" namespace="Example" defaultMergeFormat="DeephavenV1" storageType="Extended">
  <Partitions keyFormula="${autobalance_single}" />

  <ExtendedStorage type="coreplus:hive" />

  <Column name="Date" dataType="String" columnType="Partitioning" />
  <Column name="MyCol" dataType="Double" />
</Table>

See the table storage documentation for the full list of supported layouts.

You may use this Groovy script in a Core+ worker to process a Parquet table, define a schema and import it into Deephaven.
import static io.deephaven.parquet.table.ParquetTools.readTable
import static io.deephaven.shadow.enterprise.com.illumon.iris.db.tables.TableDefinition.STORAGETYPE_EXTENDED
import io.deephaven.enterprise.compatibility.TableDefinitionCompatibility
import io.deephaven.shadow.enterprise.com.illumon.iris.db.schema.SchemaServiceFactory
import io.deephaven.shadow.enterprise.com.illumon.iris.db.schema.NamespaceSet
import io.deephaven.shadow.enterprise.com.illumon.iris.db.schema.xml.SchemaXmlFactory
import io.deephaven.shadow.enterprise.org.jdom2.Element;
import io.deephaven.shadow.enterprise.com.illumon.iris.db.schema.xml.ExtendedStorageXmlImpl;
import io.deephaven.shadow.enterprise.com.illumon.iris.db.schema.ExtendedStorageXml;

NAMESPACE = "MyNamespace"

createSchemaFor = { String namespace, String tn, String templatePath, String st ->
    def template = readTable(templatePath)
    def enterpriseDef = TableDefinitionCompatibility.convertToEnterprise(template.getDefinition())
    enterpriseDef.setName(tn)
    enterpriseDef.setNamespace(namespace)
    enterpriseDef.setStorageType(STORAGETYPE_EXTENDED)
    def xmlSchema = SchemaXmlFactory.getXmlSchema(enterpriseDef, NamespaceSet.SYSTEM).getMutable()
    def extendedElem = new Element("ExtendedStorage")
                .setAttribute("type", st);
    xmlSchema.setExtendedStorage(ExtendedStorageXmlImpl.fromImmutableElement(extendedElem))
    return xmlSchema;
}

ss = SchemaServiceFactory.getDefault()
if(!ss.containsNamespace(NAMESPACE)) {
    ss.createNamespace(NamespaceSet.SYSTEM, NAMESPACE)
}

ss.addSchema(createSchemaFor(NAMESPACE, "MyTable", "/path/to/table.parquet", "coreplus:hive"))

Data ingestion

Built-in date/time converters

The schema creators have a number of built-in converters to recognize various date/time formats. Unrecognized formats will result in a date/time field being treated as String. If ambiguous data is included, such that multiple converters match all the data values, the column will also be treated as String, and a warning message will be written to the log.

Built-in ConverterMatching Data
ISO format with date only2017-08-30
ISO UTC date/time with optional sub-second precision (no T)2017-08-30 11:59:59.000Z
ISO UTC date/time with optional sub-second precision (with T)2017-08-30T11:59:59.000Z
ISO UTC two digit offset date/time with millisecond precision2017-08-30T11:59:59.000+04
ISO UTC four digit offset date/time with millisecond precision2017-08-30T11:59:59.000+0400
ISO UTC two digit offset date/time with microsecond precision2017-08-30T11:59:59.000123+04
ISO UTC four digit offset date/time with microsecond precision2017-08-30T11:59:59.000123+0400
ISO time zone date/time with optional sub-second precision2017-08-30T11:59:59.000 NY
Exchange timestamp (will be converted to NY time)2017-08-30T11:59:59.000
Timestamp UTC with optional sub-second precision2017-08-30 11:59:59 or 2017-08-30 11:59:59.123456
SQL Server default formatAUG 30 2017 11:59AM

Custom date/time converters

Converters for other formats can be included by adding their details to a property file used by the schema creator process.

Here is an example converter that could be added in a property file:

SchemaCreator.converter.name.1=USDateTimeSeconds1

SchemaCreator.converter.pattern.1=^((0?\\d)|(1[0-2]))\\/(([0-2]?\\d)|(3[0-1]))\\/\\d{4} (([0-1]?\\d)|(2[0-3])):[0-5]\\d:[0-5]\\d$

SchemaCreator.converter.formula.1=new DBDateTime(new java.text.SimpleDateFormat("MM/dd/yyyy HH:mm:ss").parse($TARGETNAME$).getTime()*1000000)
SchemaCreator.converter.sourceType.1=String

SchemaCreator.converter.targetType.1=DateTime

There are five components that comprise the definition for the custom converter. All five are required. Note the format of the property names: SchemaCreator.<atrribute name>.<unique index of the converter>. The index value must match across the five properties. The values do not have to start at 1, or be sequential, but it is important to ensure index numbers across multiple property files do not collide.

These five pieces can appear in any order and do not have to be contiguous, but all five pieces must be present if any one of them is present. If an incomplete converter entry is detected when a schema creator process starts, the process will fail and will log details about which components of the definition are missing.

  1. name - The name attribute is mainly for convenience in identifying a particular converter.
  2. pattern - The pattern is a regular expression string used to match data the converter can process. It must be very specific so this converter does not overlap any other converters, and so this formula is not given any data that it cannot process. The example regex limits the month portion of strings it will match to numbers less than 13, which helps avoid matches to non-US date strings where the day would be written before the month. When trying to find a converter to use for complex data, the schema generator will try to find one and only one converter whose pattern matches all the non-empty values from the sample data. Note that all backslashes in any property values must be escaped by doubling them. In this example, the regex will match date/time strings like this one: 01/11/1975 11:10:23
  3. formula - The formula contains a Java expression that will be used to convert the source data to the target data type. It must include the $TARGETNAME$token, which will be replaced with the column name by the schema generator. In this case, a Java SimpleDateFormat is being used to parse the date/time string into a Java Date. Then the .getTime() method is used to get an epoch milliseconds value for the Date. Deephaven uses epoch nanoseconds, so this value is then multiplied by one million and passed to the DBDateTime creator method to return a DBDateTime.
  4. sourceType - The sourceType controls how the importer will interpret a value before passing it to the formula. It will usually be String, but if the source value was already an epoch value, for instance, then the source type might be long instead of String.
  5. targetType - The targetType must match the type produced by the formula. This is used by the schema generator to set the appropriate column type for columns that match the regex pattern.

ImportSource

The ImportSource component of a schema provides additional metadata to control and modify the import process.

<Table name="JDBCNames" namespace="Import" storageType="NestedPartitionedOnDisk">
     <ImportSource name="JDBC1" type="JDBC">
         <ImportColumn name="FirstName" sourceName="first_name" />
         <ImportColumn name="LastName" sourceName="last_name" />
         <ImportColumn name="Gender" sourceName="gender" />
         <ImportColumn name="Race" sourceName="race" />
         <ImportColumn name="FirstNameLength" sourceName="first_name" formula="FirstNameLength.length()" sourceType="String" />
    </ImportSource>

<Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__"/>
     <Column name="Date" intradayType="none" dataType="String" columnType="Partitioning" />
     <Column name="FirstName" dataType="String" columnType="Normal" />
     <Column name="LastName" dataType="String" columnType="Grouping" />
     <Column name="Gender" dataType="String" columnType="Normal" />
     <Column name="Race" dataType="String" columnType="Normal" />
     <Column name="FirstNameLength" dataType="int" columnType="Normal" />
</Table>

ImportSources are highly customizable. They can be used to specify column mappings, custom formulas such as formula="FirstNameLength.length() in the above example, custom FieldWriters that operate on multiple input columns, and more. See the Schema ImportSource documentation for more information.

Data validation

Data validation rules can be placed directly into a table's schema file by adding a <Validator> element which contains the desired validation methods. For example, the DbInternal.ProcessEventLog schema contains the following validator:

<Validator>
    <assertColumnTypes />
    <assertNotNull columns="Timestamp,Host,Level,Process" />
    <assertFracNull column="LogEntry" min="0" max="0.25" />
    <assertSize min="1" max="10000000000" />
</Validator>

See Data validation for more information.

Managing schema files

Once schemas have been defined or changed, they must be made visible to Deephaven (deployed). If any custom loggers are defined, these must be generated.

The logger generation can be done from deployed schemas, or from another location prior to schema deployment so the generated loggers can be integrated with the customer application before being deployed to Deephaven.

Two scripts are provided for managing schema files, and are described below:

  1. dhconfig both deploys schemas to a Deephaven database, making their described tables available for use, and exports schemas from etcd to a specified directory. See: Configuration Tools.
  2. generate_loggers generates the logger classes from schemas. For more information on loggers and listeners see Streaming Data.

These scripts can be used during the initial setup of a new Deephaven system, as well as to update configuration when schemas are added or modified. Rerunning the scripts is safe, even if nothing has changed since the last time they were run.

Deephaven clusters use etcd to centralize storage of schema files. For a detailed guide to managing schema files (deploying, editing, etc.), see Schema Management.

Appendix: Available column codecs

The following column codecs are provided with Deephaven. For columns of these types, it is much more space efficient to use the corresponding codec than to rely on the default Java serialization.

Class NameColumn TypeCodec DescriptionArgumentsArgument Descriptions
com.illumon.util.codec.BigDecimalCodecjava.lang.BigDecimal[<precision> [, <scale>]]The precision argument specifies the maximum number of decimal digits that may be stored. The scale argument specifies the maximum number of decimal digits after the decimal point. If precision is specified but scale is not, a scale of 0 is inferred. If precision is specified, a space-efficient fixed size encoding is used. Otherwise, a variable size encoding is used, with a maximum precision of 1000. Null values are permitted.
com.illumon.util.codec.BigIntegerCodecjava.lang.BigInteger[<precision>]If specified, precision is used to specify the maximum number of decimal digits storable by this column. This codec uses the same encoding as BigDecimalCodec with a scale of zero. Null values are permitted.
com.illumon.util.codec.ByteArrayCodecbyte[][<size> [, nullable | notnull]]If size is specified, values are encoded as fixed-size values. If not, a variable-size encoding is used.
A flag indicating nullability may also be specified. The default is nullable. Specifying notnull saves one byte per value.
com.illumon.util.codec.StringBooleanMapCodecjava.util.Map<String, Boolean>Encodes a map with String keys and Boolean values.NoneN/A
com.illumon.util.codec.StringDoubleMapCodecjava.util.Map<String, Double>Encodes a map with String keys and Double values.NoneN/A
com.illumon.util.codec.StringFloatMapCodecjava.util.Map<String, Float>Encodes a map with String keys and Float values.NoneN/A
com.illumon.util.codec.StringLongMapCodecjava.util.Map<String, Long>Encodes a map with String keys and Long values.NoneN/A
com.illumon.util.codec.StringIntMapCodecjava.util.Map<String, Int>Encodes a map with String keys and Integer values.NoneN/A
com.illumon.util.codec.StringStringMapCodecjava.util.Map<String, String>Encodes a map with String keys and String values.NoneN/A