Failure modes

Understanding potential failure modes is essential for building a resilient Deephaven deployment. A failure mode describes how a component, system, or process can deviate from its intended function.

For a complex, distributed platform like Deephaven, identifying and analyzing failure modes helps you:

  • Design robust architectures
  • Implement effective monitoring
  • Develop targeted recovery strategies
  • Ensure business continuity

This document categorizes common failure modes across hardware, software, configuration, and environmental issues. It provides system administrators with foundational knowledge to mitigate risks and maintain operational stability.

Hardware and infrastructure failures

This category covers failures related to the physical or virtual infrastructure hosting the Deephaven environment.

Virtual machine failures

In cases where Deephaven servers are running on virtual machines, concerns about likely failure modes are somewhat different than they are when the servers are physical machines.

Virtual machines can still lose locally attached storage through an underlying hardware failure. In such cases, restoring/replacing the machine from a recent snapshot image/backup should allow fairly quick restoration of functionality.

  • If the server in question was a query server, there should normally be no data loss associated with the event other than worker event logs.
  • If it was an infrastructure node hosting data import services, there will be loss of intraday data imported and/or ingested since the snapshot was taken.

Ideally, local storage used for at least infrastructure nodes hosting data import services should be fault tolerant as well (e.g., RAID 1 or RAID 5). Hardware failures affecting the underlying host (motherboard, memory, NIC, etc.) will still cause an outage while the VM is reprovisioned to a different host. The recovery time depends on your virtualization platform's failover capabilities and configuration - some environments support automatic VM restart on alternate hosts, while others require manual intervention.

Shared storage failures

For historical data, the shared storage system is responsible for providing fault-tolerant protection of data. At the minimum, the storage underlying NFS or similar shared storage should provide some sort of single-device-loss tolerant RAID. A more robust setup that provides site redundancy would be a configuration that synchronizes data between storage arrays in two data centers, allowing for a DR backup Deephaven installation that has access to a replicated copy of the same data available in the primary installation.

Software, configuration, and environmental failures

The more common issues that disrupt service availability in a Deephaven installation are software and configuration-based. Most of these are not things that high availability is meant to combat, and, in most cases, such issues will impact all instances of redundant services in a single environment.

Resource exhaustion

  • Out of disk space: This is one of the most common causes of failures.
  • Out of memory or CPU: Can cause processes to become unresponsive or crash.

Configuration drift and errors

  • Expired certificates: Can prevent secure communication between components.
  • Misconfiguration of a key service: Incorrect settings in routing, authentication, or other core services.
  • Change of DNS or IP addresses: Can cause an inability to resolve names and can break the etcd cluster, which requires stable IP addresses or hostnames for its nodes.

Dependency and environmental issues

  • Incompatible upgrades: Upgrading or patching a dependency service or package can make it incompatible with other components.
  • Configuration management interference: Tools like Puppet, Chef, or Ansible can accidentally disable or uninstall required components or accounts.
  • Network blocking: Endpoint protection software, stateful firewalls, or other network security tools can block packets or close connections seen as inactive.

Disaster recovery (DR) strategies

Having additional installations, locally and/or remotely, can provide a backup environment to be used while a production issue is analyzed and resolved.

The importance of multiple environments

For these reasons and others, it is strongly recommended that Deephaven customers have several installations, such as:

  • Production: The live environment serving users.
  • Test/QA/UAT: For testing upgrades of Deephaven, patches, and new customer code and queries.
  • Development: For developing new customer code and queries.
  • Prod backup: A DR backup environment in the same data center as the production installation.
  • DR: A DR backup environment in a different data center from the production installation.

DR models

A DR "environment" can range from simple and lightweight to a complete, parallel installation.

  • Cold DR: This model uses snapshots of VMs that can be deployed in a different virtual data center in case of loss of the primary data center. This assumes that needed data is replicated and snapshots are taken frequently enough to meet recovery point objectives (RPO).
  • Warm DR: This intermediate solution might involve having continuously running infrastructure servers (including data ingestion) and etcd nodes, but with query servers powered off until the DR environment is needed.
  • Hot DR: At the other extreme is a complete duplicate of the primary production cluster that runs continuously in parallel with the primary installation. This allows users to switch to DR immediately when needed, with minimal to no loss of data (low recovery time objective, or RTO).

Data replication for DR

In general, DR and Prod backup installations should have access to all of the same historical, user, and streaming data that production has, so these environments can provide business continuity in case the production installation, or even the entire production data center, is down. Depending on the types of validations and development being done, test and development installs may be able to work with subsets of data and less compute/storage resources.