Filesystem data layout

Deephaven currently supports three possible table layouts on the filesystem:

  • Splayed tables are simple column-oriented data stores without any partitioning, meaning they are stored in one location directory. They are typically used for storing query result sets as tables in user namespaces.
  • Partitioned tables are hierarchical stores of many splayed tables, each representing a single data partition. The Deephaven query engine automatically combines partitions into a single table for presentation purposes, although this step is deferred for optimization purposes.
  • Extended layouts are more complicated layouts for Parquet-formatted and Deephaven-formatted tables.

In practice, splayed tables are used for storing a user's query result as a table in a user namespace, while partitioned tables are used for storing multiple user query results in a user namespace or for storing historical data in system namespaces. Extended layouts are used for system namespace tables that may have been created from other tools, such as Apache Hadoop.

Partitioning

Deephaven tables use two levels of partitioning:

  • Internal partitions separate data from different sources (intraday) or distribute it across multiple devices (historical).
  • Partitioning values (typically a date string) break large tables into manageable chunks. This type of partitioning works the same for both intraday and historical tables.

Each partition is placed in its own directory, creating a two-level hierarchy:

  • Top-level directory (e.g., IntradayPartition1, PartitionName1): This directory is for internal partitions and is hidden from queries. It serves multiple purposes:
    • For historical system data, it introduces storage boundaries for load balancing and leverages metadata for deferred indexing.
    • For intraday system data, it separates data from different sources (e.g., two processes writing to the same table must use different top-level partitions).
    • For user data, it can optionally support parallel publication or other partitioning goals.
  • Second-level directory (e.g., Date1, Date2): This directory is for partition values and is exposed as a partitioning column (often named Date) to user queries. This enables query optimizations by skipping unnecessary partitions. Placing filters on this column first in a where clause maximizes performance.

A location is the directory that corresponds to a particular combination of internal partition and partition value for a table. The location directory is the leaf of the table storage and contains the column data files and other table metadata files corresponding to that location's "slice" of the table.

Design goals

In specifying the layout for Deephaven data and metadata, two primary goals are emphasized:

  1. Decentralization: Deephaven eliminates the need for a centralized server process to manage access to database files.
  2. Ease of Administration: Deephaven databases, namespaces, and tables are organized to ensure:
  • Administrators can easily introduce filesystem boundaries for provisioning purposes, such as between user and system tables or among various storage partitions within a system namespace.
  • Database objects can be added, moved, or removed using standard file operations, whether operating on the entire database or on individual namespaces, partitions, tables, or columns.

Database root

Every Deephaven installation defines a database root directory, typically /db.

  • Intraday data for system namespaces is stored in the Intraday directory under the database root, separate from historical data.
  • Historical data for system namespaces is stored in the Systems directory under the database root.
  • User namespace data is stored in the Users directory under the database root.
  • Administrators may optionally create an intraday data directory for user namespaces to support centrally-appended user data.

Each of these directories contains a well-defined set of subdirectories.

Intraday system tables

Intraday data for system namespaces is stored in the subdirectories as shown below.

/db/Intraday
|--Namespace                  (one directory for each namespace)
|  |--TableName1              (one directory for each table)
|  |  |--IntradayPartition1   (one directory for each source of the data)
|  |  |  |--Date1             (one directory for each column partition value)
|  |  |  |  |--TableName1     (matches the directory name three levels up)
|  |  |  |  |                 (contains the files which store the data)
|  |  |  |--Date2
|  |  |  |  `--...            (matches the Date1 directory structure)
|  |  |  `--...               (additional dates)
|  |  |--IntradayPartition2
|  |  `--...                  (additional data sources)
|  |--TableName2              (directory structure will be similar to TableName1)
|  |--`...                    (additional tables)

Note that the table name is in the directory structure twice. This is intentional.

  • The first level (under the namespace) facilitates the management of all data for any table. For instance, if a table is no longer needed, it can be deleted at this level.
  • The second level (at the final level of the directory tree) mirrors the historical directory structure layout, allowing easy portability when needed.

For a concrete example, the directory containing the intraday data files for a single date partition for DbInternal.PersistentQueryStateLog table might be:

/db/Intraday/DbInternal/PersistentQueryStateLog/myhost.mydomain.com/2017-08-04/PersistentQueryStateLog/

Historical system tables

Historical data for system namespaces is stored in the subdirectories, as shown below.

/db/Systems
|--Namespace             (one directory for each namespace)
| |--MetadataIndex       (directory for table location lookup)
| | |--TableName1.tlmi   (TableName1 locations index file)
| | |--TableName2.tlmi
| | `--...               (location index files for each table))
| |--WritablePartitions  (directories for writing historical data)
| | |--PartitionName1    (usually links to Partitions/PartitionName1)
| | |--PartitionName2    (usually links to Partitions/PartitionName2)
| | `--...               (additional linked directories as appropriate)
| |--Partitions          (directories for reading historical data)
| | |--PartitionName1    (internal partition for reading data)
| | | |--Date            (one directory for each column partition value)
| | | |--Date1           (one directory for each column partition value)
| | | | |--TableName1    (one directory for each table)
| | | | | |--Index-Col   (one directory for each grouping or data index column)
| | | | |                (contains the files which store the data)
| | | | |--TableName2
| | | | `--...           (additional tables)
| | | |--Date2
| | | | `--...           (matches the Date1 directory structure)
| | | `--...             (additional dates)
| | |--PartitionName2
| | | `--...             (matches the PartitionName1 structure)
| | `--...               (additional partitions as needed)
| | |--Extended          (Directories for reading Extended layouts)
`--...                   (additional namespaces)

For example, an initial layout for the DbInternal namespace might be created as follows. An internal partition called 0 is created to hold the data. For simplicity, only the Partitions and WritablePartitions subdirectories are shown.

|--DbInternal
|  |--WritablePartitions
|  |  |--0                                        (links to Partitions/0)
|  |--Partitions
|  |  |--0
|  |  |  |--2017-08-03
|  |  |  |  |--PersistentQueryStateLog
|  |  |  |  |--PersistentQueryConfigurationLog
|  |  |  |  `--...
|  |  |  |--2017-08-04
|  |  |  |  |--PersistentQueryStateLog
|  |  |  |  |--PersistentQueryConfigurationLog
|  |  |  |  `--...
|  |  |  `--...

When the initial partitions become full, the sysadmin should create new partitions and links as shown below. New data will be added to the internal partition "1", while data will be read from both partitions "0" and "1".

|--DbInternal
|  |--WritablePartitions
|  |  |--1                                      (links to Partitions/1)
|  |--Partitions
|  |  |--0
|  |  |  |--2017-08-03
|  |  |  |  |--PersistentQueryStateLog
|  |  |  |  |--PersistentQueryConfigurationLog
|  |  |  |  `--...
|  |  |  |--2017-08-04
|  |  |  |  |--PersistentQueryStateLog
|  |  |  |  |--PersistentQueryConfigurationLog
|  |  |  |  `--...
|  |  |  `--...
|  |  |--1
|  |  |  |--2017-09-01
|  |  |  |  |--PersistentQueryStateLog
|  |  |  |  |--PersistentQueryConfigurationLog
|  |  |  |  `--...
|  |  |  |--2017-09-02
|  |  |  |  |--PersistentQueryStateLog
|  |  |  |  |--PersistentQueryConfigurationLog
|  |  |  |  `--...
|  |  |  `--...

For a concrete example, the directory containing the historical data files for a single date partition for DbInternal.PersistentQueryStateLog table might be:

/db/Intraday/DbInternal/Partitions/0/2017-08-04/PersistentQueryStateLog/

Extended layout historical system tables

Extended Layouts enable more complex table structures that can be created using tools like Apache Hadoop. To utilize this feature, Table Schemas must specify the Extended Layout type. These tables can include multiple partitioning columns.

Deephaven supports the following layouts for both Deephaven and Parquet tables:

  • coreplus:hive: This layout consists of a directory tree with one level for each partitioning column. Each directory is named PartitionColumnName=PartitionValue, and the leaf directories contain either Parquet files or Deephaven-format tables.

For Parquet tables, Deephaven supports these additional layouts:

  • parquet:flat: A single directory containing multiple Parquet files, each representing a fragment of the entire table.
  • parquet:kv: Equivalent to coreplus:hive layouts.
  • parquet:metadata: Similar to parquet:kv layouts, but includes a _metadata and an optional _common_metadata file in the root directory that explicitly defines the files for various table partitions.

parquet:flat

Tables in the parquet:flat layout consist of multiple Parquet files stored within the Extended directory. Each file represents a fragment of the entire table. When the data is loaded into a table, it is done so alphabetically by filename.

|--/db/Systems/MyNamespace
|  |-- Extended
|  |  |-- Companies
|  |  |  |-- part.0.parquet
|  |  |  |-- part.1.parquet
|  |  |  |-- part.2.parquet
|  |  |  |-- ...

parquet:kv and coreplus:hive

The parquet:kv and coreplus:hive layouts are hierarchical directory structures where each level represents a partitioning column. Each directory is named in the format PartitionColumnName=PartitionValue. The leaf directories contain either Parquet files or Deephaven tables.

|--/db/Systems/MyNamespace
|  |--Extended
|  |  |-- Companies
|  |  |  |-- PC1=Val1
|  |  |  |  |-- PC2=Val1
|  |  |  |  |  |-- data.parquet
|  |  |  |  |-- PC3=Val2
|  |  |  |  |  |-- data.parquet
|  |  |  |  |-- ...
|  |  |  |-- PC2=Val2
|  |  |  |-- PC3=Val3
|  |  |  |-- ...

parquet:metadata (Hive)

When using a standard Enterprise table layout, the Deephaven engine traverses the table's directory structure to discover partitions. Partition discovery can take significant time depending on your file system's performance characteristics (e.g., if it must traverse a WAN). To accelerate this process, standard layout tables can use a metadata index that supports a single internal partition and column partition. For Hive layout tables, instead of a metadata index, you must use a Locations table, which provides the engine with a list of all partitions and corresponding files. If the .locations_table subdirectory exists in the root of the table (e.g., /db/Systems/MyNamespace/Extended/MyTableName/.location_table), the worker reads the locations table instead of traversing directories.

The Deephaven merge process does not support creating Hive layout tables. Tables must be written into the correct directory structure manually. When using a Locations table, it must be updated along with the underlying data. The HiveLocationsTableKeyFinder.writeLocationsTable method scans the underlying storage, generates a new Locations table, and writes it to the correct location. The scanning process must traverse the entire directory structure of the table. As you change partitions in the data store, you can manually append newly created files to the Locations table instead. The last entry for a file in the Locations table is used during partition discovery.

Warning

If you use a Locations table, the system will not use the underlying data discovery mechanism. It is crucial to keep the Locations table synchronized with the actual table locations. If the Locations table is not in sync with the underlying data, you may encounter null rows (when the Locations table includes more rows than the underlying data) or missing rows (when the Locations table does not represent all the underlying data).

The parquet:metadata layout is similar to the parquet:kv layout, but it includes a _metadata file and an optional _common_metadata file. These files store the paths to each Parquet file that constitutes the table. The directory structure can be the same as the parquet:kv layout (also known as the coreplus:hive layout), or the directory names can simply be the partition values at each level, as shown below.

|--/db/Systems/
|  |--Extended
|  |  |--Companies
|  |  |  |-- _metadata
|  |  |  |-- _common_metadata
|  |  |  |-- Val1
|  |  |  |  |-- Val2
|  |  |  |  |  |-- data.parquet
|  |  |  |  |-- Val3
|  |  |  |  |  |-- data.parquet
|  |  |  |  |-- ...
|  |  |  |-- Val4
|  |  |  |-- PC3=Val3
|  |  |  |-- ...

User tables

The Users subdirectory mirrors the structure of the Systems subdirectory, with the addition of a Tables directory for non-partitioned splayed tables and the absence of WritablePartitions.

/db/Users
|--Namespace                  (one directory for each user namespace)
|  |--Definitions             (table definitions managed by Deephaven)
|  |  |--.TableName1-LOCKFILE (hidden)   (for internal Deephaven usage)
|  |  |--.TableName2-LOCKFILE (hidden)
|  |  `--                     (hidden lockfiles for each table)
|  |  |--TableName1.tbl       (metadata file for TableName1 table)
|  |  |--TableName2.tbl
|  |  `--...                  (metadata for each table)
|  |--Partitions              (matches the Systems partitions structure)
|  |--Tables                  (contains data for non-partitioned tables)
|  |  |--TableName1           (one directory for each table contains the files
|  |  |                       that store the data)
`--...                        (additional namespaces)