Table Storage Overview

How data is stored in Deephaven is an important topic for achieving a high-performance deployment. This guide provides an overview of the different types of data, how they are stored, how they are accessed, and how to configure the data for optimal performance.

Partitioning

Partitioning is a key concept in Deephaven. It allows data to be split into smaller, more manageable pieces that can be stored and accessed independently. This can improve performance by allowing data to be accessed in parallel and by reducing the amount of data that needs to be loaded into memory at any given time.

Consider a table with 100 billion rows. If the table is partitioned by date, and each partition contains 1 billion rows, then only 1% of the data needs to be loaded into memory to access a single partition. This can significantly reduce the amount of memory and time required to work with large datasets.

Namespace types

In Deephaven, there are two types of namespaces:

System namespaces are used for data that is important for business processes or used by many individuals.
User namespaces are used for data that is not as important or is only used by a few individuals.

System namespaces

System namespaces follow a structured administrative process for updating schemas, importing, merging, and validating data. Their schemas are defined using Deephaven schema files, which are often stored in version control systems for collaboration and tracking revision history. These schemas are updated on a business-appropriate schedule by administrative users. Queries cannot directly modify tables in these namespaces via the Database APIs. System namespace tables are modified using system-level import jobs or merge queries.

Data that is crucial for business processes or widely used should reside in a system namespace.

User namespaces

User namespaces are managed directly by users with limited privileges via the Database APIs. These namespaces generally do not utilize external schema files and are often not subject to the same administrative processes as system namespaces.

Typically, user namespaces are used to store intermediate query results or to experiment with research ideas. If the data in a user namespace becomes more significant, it can be migrated to a system namespace for better management and integration.

Availability types

Availability type is used to categorize data by its timeliness and stability. Note that it's somewhat less meaningful to talk about intraday data for user namespaces, but not entirely irrelevant.

Intraday data

Intraday tables can be thought of as a "live" view of data that is continuously updated. Data is appended to the table as it becomes available, and the table is not compacted or optimized for query performance. Intraday tables are typically used for real-time analytics and monitoring.

Intraday data for system namespaces is internally partitioned by source.

Historical data

Historical tables are used for data that is no longer being updated. Data is compacted and optimized for query performance. Historical tables are more appropriate for long-term storage and analysis of data.

Historical data is partitioned on query-visible columns. This partitioning is usually done by date, but it can be done on any column. This allows for more efficient querying of the data.

When converting Intraday tables to historical tables using the merge process, data is first partitioned based on storage load-balancing criteria as detailed in Tables and schemas. Column partitioning is then applied. During the merge, data may be re-ordered by sorting or grouping rules, while maintaining the relative order within each source. Post-merge, validation ensures that the historical data accurately reflects the original intraday data and adheres to domain-specific requirements.

Next steps

Filesystem Data Layout - Learn how data is stored on disk and how to set up storage for a new Deephaven installation.
Data Indexes - Learn how Deephaven can use indexes to improve query performance.
Indexing intraday data - Learn how to index intraday data for better performance.
Splayed Tables - Learn about Deephaven's proprietary table storage format. (Optional)
S3 - Learn how to use S3 as a data store for Deephaven. (Optional)
NFS - Learn how to use NFS as a data store for Deephaven. (Optional)