Schema overview
All Deephaven tables stored in the database (e.g., can be read through db.live_table
in a query) have a schema that defines a table's namespace, name, column names, and data types. In addition to specifying the structure of the data, a schema can also include:
- Directives controlling how data is imported and stored, such as encoding formats for String columns and codecs for custom serialization of complex data types.
- Metadata for data ingestion, such as custom DateTime converters.
- Data validation rules for ensuring data quality during a merge.
Columns
Deephaven schemas define the names and data types for each column in a table. Below are some columns from the DbInternal.AuditEventLog
table:
<Column name="Date" dataType="String" columnType="Partitioning" />
<Column name="Timestamp" dataType="DateTime" />
<Column name="ClientHost" dataType="String" />
<Column name="ClientPort" dataType="int" />
<Column name="Details" dataType="String" symbolTable="None" encoding="UTF_8" />
The Column
element can also specify how the data is stored on disk. For example, the DbInternal.AuditEventLog
table is partitioned on Date
. See the full list of available column attributes in the table and schemas concept guide.
Data types
Data types can generally be any Java class, such as Java primitive types, arrays of primitive types, and Strings. Column codecs provide custom serialization logic for complex data types. See dataType for more information.
Historical data
There are two main categories of data storage in Deephaven: intraday and historical. Some historical storage options can be configured in the schema.
Merge attributes
Intraday data can be merged to historical storage in Deephaven or Apache Parquet formats. When merging data to Parquet, a default compression codec can be chosen by adding a MergeAttributes
element with an appropriate Parquet-supported codec.
Extended layouts
Extended layouts are available for users with complex Parquet layouts that are created by other tools such as Apache Hadoop. Extended layouts also allow you to use multiple partitioning columns.
Data ingestion
Schemas can be extended with metadata to control how data is ingested into Deephaven. This includes:
- DateTime converters for parsing date strings
- Custom field writers for importing data from CSV, JSON, JDBC, and XML files
- Data validation rules for ensuring data quality during a merge
Managing schema files:
Schemas are stored in etcd and can be imported to or exported from Deephaven using dhconfig schemas
. Special care must be taken when updating a schema during the ingestion window.
Schema inference
Deephaven provides tools that can make writing new schemas easier by automatically inferring the schema from the data source. Schema Inference is available for the following data sources:
CopyTable schemas
One table layout may be used for multiple system tables. When this is required, it is not necessary to replicate the entire source schema definition for each new table. See CopyTable for more information.