Deephaven Data Lifecycle
The scaling of Deephaven to handle large data sets is mostly driven by the data lifecycle. Deephaven has been designed to separate the write-intensive applications (db_dis
, importers
) from the read/compute intensive applications (db_query_server
, db_query_workers
, etc.).
The diagram below shows a generalized version of the processes responsible for handling data as part of the Deephaven engine. An external data source can be imported via a stream by generating binary logs fed to the Data Import Service (db_dis
) process or by manually running an import using one of Deephaven's many importers. Once in the system, either type can be queried by end-users via the db_query_server
and its workers.
Two types of data
Deephaven views data as one of two data types: intraday (near real-time) data or historical data. Each data type is stored in different locations in the database filesystem.
Intraday data
- Intraday data is stored in
/db/Intraday/<databaseNamespace>/<tableName>
. When deploying servers, it is advised that each of these be on low latency, high-speed disks connected either locally or via SAN. All reads and writes of this data are done through this mount point. Depending on data size and speed requirements, one or more mount points could be used at the/db/Intraday
,/db/Intraday/<databaseNamespace>
, or/db/Intraday/<databaseNamespace>/<tableName>
levels - The
db_dis
process reads/writes data from/to these directories. - The
db_ltds
process reads data from these directories. - If the administrator doesn't create mount points for new namespaces and/or tables before using them, Deephaven will automatically generate the required subdirectories when data is first written to the new tables.
Historical data
-
Historical data is stored in
/db/Systems/<databaseNamespace>
. -
Intraday data is merged into historical data by a manual or
cron merge
process. -
If the required subdirectories don't exist, an attempted merge will fail.
-
Each historical database namespace directory contains two directories that must be configured by the administrator:
WritablePartitions
- used for all writes to historical dataPartitions
- used for all reads from historical data The (historical)<databaseNamespace>
is divided into aPartitions
andWritablePartitions
pair of directories. The subdirectories of these two will contain the data. Each of these subdirectories are either mounted shared volumes or links to mounted shared volumes. Partitions should contain a strict superset ofWritablePartitions
. It is recommended that each<databaseNamespace>
be divided across many shared volumes to increase IO access to the data. When historical partitions are first set up for a namespace, theWritablePartitions
andPartitions
subdirectories will typically refer to the same locations. For example, if there are six Partitions named "0" through "5", then there will be six links named "0" through "5" in theWritablePartitions
to thosePartitions
directories. Over time the devices holding these directories will fill up and additional space will be required. Additional directories (such as "6" through "11) can be created in Partitions pointing to new storage, and theWritablePartitions
links updated to point to these new directories. This should be done by deleting the old links inWritablePartitions
and creating new ones with the same names as the new Partitions directories. In this way the already written historical locations will become read-only, and future merges will write to the newly allocated storage.
-
All volumes mounted under
WritablePartitions
andPartitions
should be mounted on all servers. However, since these are divided by read and write functions, you could potentially have a Query Server that only had the read partitions mounted or an Import Server with only theWritablePartitions
mounted. filesystem permissions could also be controlled in a like manner: the Partitions volumes only need to allow read-only access. A server that only performs queries would only need these mounted without theWritablePartitions
if desired. -
A large historical data installation will look like this:
Data lifecycle summary
- Intraday disk volumes (or subdirectory partitions thereof) should be provided for each database namespace via local disk or SAN and be capable of handling the write and read requirements for the data set.
- Intraday data is merged into historical data by a configured merge process.
- Once merged into historical data, intraday files should be removed from the intraday disk by a manually configured data clean up process.
- Historical shared (NFS) volumes (or subdirectory partitions thereof) should be provided for each database namespace via shared filesystem that is mounted under
/db/Systems/<databaseNamespace>/WritablePartitions
and/db/Systems/<databaseNamespace>/Partitions
on all servers. - Historical data for each database namespace has
WritablePartitions
for writing data andPartitions
for reading data.