Deephaven resilience planning

High availability, fault tolerance, and disaster recovery

These topics have some amount of overlap. In the context of this document:

  • High availability - continued operations with short downtime and some manual actions needed by administrators and/or users.
  • Fault tolerance - continued operations with no user awareness of an issue other than possibly degraded performance.
  • Disaster recovery - resuming operations after a significant loss of data center functionality or connectivity.

Besides the consideration of system availability, there is also planning for recovery time, tolerable data loss, and what amount of data reloading, if any, is acceptable in case of a failure. Typically, these are referred to as:

  • Recovery Time Objective (RTO) - maximum down time after a system failure.
  • Recovery Point Objective (RPO) - maximum acceptable data loss from a system failure.

Deephaven services and dependencies

3rd party services that support Deephaven clusters:

Service NameLocationFT Capability
etcdAny servers, even separate nodes.Automatically fault tolerant with 3 or more odd numbers of nodes.
MySQL/mariadbTypically infra, but could be anywhere.Some editions support clustering; default is not fault tolerant.
envoyInfra (optional)Possible to build a cluster. Usually running in docker, so easy to replace.
jupyterhubInfra (optional)Possible to build a cluster.

Deephaven services:

Service NameLocationFT Capability
configuration_serverInitially infra, can be on all DH nodesAdding instances allows for fault tolerance.
authentication_serverTypically infra, but can add othersAdding instances allows for fault tolerance.
web_api_serviceInfra.Not fault tolerant. Web consoles and OpenAPI are dependencies. May not start new Swing consoles.
tailerCan be anywhere in an environment.Individual tailers are not fault tolerant, but multiple tailers can provide redundant processing. If a tailer fails, a new tailer typically starts where the previous one ended.
log_aggregator_serviceOne per DH server.Not FT, but scope of impact is one server.
iris_controllerTypically infra, but others can be added.One controller process functions as the leader. Additional controllers are configured as hot-spares, and participate in leader election on startup.
db_tdcpUsually one per DH server.Not FT, but scope of impact is one server. Its use is optional.
db_merge_serverInitially infra; can have additional instances.Individual merge servers are not fault tolerant, but multiple merge servers can provide redundant processing.
db_query_serverEach query server.Individual query servers are not fault tolerant, but also don't own data resources, so other query server instances can run tasks that had been running on a failed node.
db_ltdsEach merge server.Redundant LTDSs can be configured for round robin and failover in the data routing setup.
db_disInitially infra, but can add othersRedundant DISs can be configured for round robin and failover in the data routing setup.
db_acl_writerUsually whereever there is an authentication_serverNot FT. Loss of this service would prevent changes to DB permissions and accounts.

In a default three-node Deephaven installation, the inherent fault tolerance is limited to losing one node of the three-node etcd cluster with no noticeable impact. It is, however, unlikely that a single instance of etcd will fail without a corresponding failure of the server on which it is running. In a default Deephaven installation etcd will be running on each of the three nodes (infrastructure and two query servers).

Loss of the infrastructure node will make the entire installation unavailable, as no workers can be launched without the authentication and controller functions which are hosted on the infrastructure node.

Loss of a single query server is a moderately disruptive high availability event. At the minimum, loss of a query server would mean that users would need to move their queries to the remaining query server. If the lost query server was Query1, then system PQs, such as the WebClientData query, would also need to be moved to re-enable functionality such as Web UI logins. There is also the question of whether the remaining query server will have sufficient capacity to host at least all critical queries. A good starting point for planning high availability is an n-1 approach when selecting the number of query servers.

Workload placement for queries allows PQs to have their query servers auto-selected. This allows for transparent failover in the case of loss of a query server, as long as there are sufficient compute resources available in the environment.

If the Persistent Query Controller fails to renew the etcd lease for its leadership (or resigns), another running controller process is elected leader. A controller may fail to renew its lease because it crashed or became non-responsive. When a controller is shut down, it resigns leadership. When the new leader is elected, it examines the current state of workers (stored to etcd). The controller attempts to connect to any running Core+ workers and establish bi-directional communication. If the controller cannot connect to the worker in a timely manner, then the worker is terminated. Any worker not in the "Running" state is terminated and cannot be restored because the controller does not have a consistent view of the worker's state. Legacy workers and workers without a "Running" state (such as batch queries) cannot be restored.

Data and configuration storage and sharing

The general recommendation for production Deephaven installations is at least three nodes. In this configuration, the sharing of configuration and data across nodes is:

  • Configuration (properties, routing settings, persistent queries, schemata) - stored in etcd, which provides fault tolerance automatically when it has an odd number of nodes >= 3.
  • Intraday data (including intraday user tables) - stored in the /db/Intraday and /db/IntradayUser directory structures for the main DIS, and under specific directories for in-worker DIS / ingester processes.
  • User tables - stored under /db/User - should be shared across servers using NFS or similar so users can access their data from any server. The array providing or backing NFS should also provide data protection and fault tolerance.
  • Historical tables - stored under /db/Systems - should be shared across servers using NFS or similar so users can access their data from any server. The array providing or backing NFS should also provide data protection and fault tolerance.
  • Web dashboards and other UI content - stored in the DbInternal.WorkspaceData table. This table should have merges run for it to consolidate historical dashboard changes from intraday to historical. Loss of intraday data will also result in loss of unmerged Web UI content.
  • Some aspects of configuration, such as ClientUpdateService settings, and custom jars - stored under the /etc/sysconfig/deephaven path. It may be desirable to share some of these paths, such as /etc/sysconfig/deephaven/illumon.d.latest/java_lib/, so that things like custom jars are available from all servers.

Failure modes to plan for

In cases where Deephaven servers are running on virtual machines, concerns about likely failure modes are somewhat different than they are when the servers are physical machines.

Virtual machines can still lose locally attached storage through an underlying hardware failure. In such cases, restoring/replacing the machine from a recent snapshot image/backup should allow fairly quick restoration of functionality. If the server in question was a query server, there should normally be no data loss associated with the event. If it was a merge server (infrastructure node in a default deployment), there will be loss of intraday data imported and/or ingested since the snapshot was taken. Ideally, local storage used for at least data import server nodes should be fault tolerant as well - e.g. RAID 1 or RAID 5. Other than a local storage failure, though, hardware failures should not be a concern for virtual machines - loss of the motherboard, bad memory, a failed NIC, etc, would all be handled by reprovisioning the VM to a different host in the data center.

For historical data, the shared storage system should be responsible for providing fault tolerant protection of data. At the minimum, the storage underlying NFS, or similar shared storage, should provide some sort of single-device-loss tolerant RAID. At the more robust end of possibilities would be a geo-cluster that synchronizes data between storage arrays in two data centers, allowing for a DR backup Deephaven installation that has access to a replicated copy of the same data available in the primary installation.

The more common type of issues that will disrupt availability of a service or services in a Deephaven installation are software and configuration based:

  • Out of disk space (probably the most common cause of failures).
  • Expired certificates.
  • Misconfiguration of a key service.
  • Change of DNS causing inability to resolve names.
  • Change of IP addresses breaking the etcd cluster (etcd requires permanent IP addresses for its nodes).
  • Upgrade or patching of a dependency service or package making it incompatible with other components.
  • Configuration management software (Puppet, Chef, Ansible, etc) disabling or uninstalling required components or accounts.
  • Endpoint protection software blocking packets or connections.
  • Stateful firewalls unnecessarily closing connections it sees as inactive.

Most of the above are not things that high availability is meant to combat, and, in most cases such issues will impact all instances of redundant services in a single environment, so high availability cannot address them. Having additional installations, locally, and/or remotely, however can provide for a backup environment to be used while the issue is analyzed and resolved. For these reasons, and others, it is strongly recommended that Deephaven customers have several installations such as:

  • Production
  • Test/QA/UAT - for testing upgrades of Deephaven product, patches, and new customer code and queries.
  • Development - for developing new customer code and queries.
  • Prod backup - a DR backup environment in the same data center as the prod installation.
  • DR - a DR backup environment in different data center from the prod installation.

In general, DR and Prod backup installations should have access to all of the same historical, user, and streaming data that production has, so these environments can provide business continuity in case the production installation, or even the entire production data center, is down. Depending on the types of validations and development being done, test and development installs may be able to work with subsets of data and less compute/storage resources.

A DR "environment" may be as simple and lightweight as snapshots of VMs that can be deployed in a different virtual data center in case of loss of the primary data center; assuming that needed data is replicated and snapshots are taken frequently enough to meet the recovery point objectives. At the other extreme would be a complete duplicate of the primary production cluster that runs continuously in parallel with the primary installation, so that users could switch to DR immediately when needed, with no loss of data. Intermediate solutions might involve having a continuously running infrastructure server and etcd nodes, but with query servers powered off until the DR environment is needed.

Data Ingestion

Streaming data ingestion typically follows a couple of different models:

  1. A data stream is subscribed by a custom process.
  • The custom process uses a Deephaven logger class to append events to binary log files.
  • A Deephaven tailer streams blocks of binary data from the binary log files to a data import server (DIS).
  • The DIS writes events to disk and publishes them to ticking table clients.
  1. Some processing somewhere results in a Kafka or Solace topic.
  • A Kafka or Solace ingester running in a merge server query subscribes to the topic.
  • The ingester receives events from the topic, writes them to disk, and publishes them to ticking table clients.

For either model, the intraday data received may be merged to historical at some point: EOD, etc.

For the first model, redundancy can be set up at several levels. Usually the logger process and tailer are local to each other and are also co-located on a system that has low-latency access to the source data stream. One option is to configure multiple data import server processes, at a minimum on multiple nodes of a single installation, but, also possibly in other installations such as prod backup or DR, and have the single tailer send data to these multiple destinations. A more resilient option would be to have a separate logger and tailer system for each DIS destination, so that the entire data stream is redundant back to its source (which probably has its own fault tolerance).

For the second model, Kafka and Solace both have their own fault tolerance capabilities which can be configured to ensure uninterrupted processing of the topic. On the Deephaven side, multiple ingester PQs can be configured on multiple merge servers in one or several installations.

For the case of multiple DIS services or ingester PQs feeding data into a single Deephaven installation, table data services in the routing YAML can be configured to treat several DISes as a failover group, to allow automatic round robin handling in case of loss of one service. This configuration will load balance between db_dis and db_dis2, but fail all activity over to db_dis2 if db_dis becomes unavailable. Each DIS instance will need its own data storage for this configuration. All DIS instances in a group must handle the same data, meaning they must have the same claims or filters.

In this example, db_dis and db_dis2 join the group group1. This group name can be used in the tableDataServices section to refer to both DISes as a failover group.

  dataImportServers:
    db_dis:
      ...
      failoverGroup: group1
    db_dis2:
      ...
      failoverGroup: group1

  tableDataServices:
    tds_using_failover_group:
      sources:
        - name: group1

A failover group can also be created explicitly in the tableDataServices section, by referring to the DIS instances in an array:

  tableDataServices:
    tds_using_failover_group:
      sources:
        # any source that is an array defines a failover group;
        - name: [db_dis, db_dis2]

For the eventual merge of data to historical, if, for example, there are DIS1 and DIS2 that both provide a particular ticking table: on a normal day, the data from DIS1 would be merged, and the data from DIS2 would be discarded. On a day when DIS1 had failed, though, the partial data from DIS1 would be discarded, and the complete data from DIS2 would be merged instead. In the extremely unlikely event that both DIS1 and DIS2 were offline for part of the day, manual work might be needed to consolidate and deduplicate the data set of their combined partial tables, and then a merge could be run from that. This would not necessarily be required, though, because tailer binary logs and Kafka and Solace topics persist data even if the DIS is unable to receive it, so, a DIS outage would normally result in the DIS catching up when it was brought back online, resulting in no missing data by the end of the day.