Resilience planning
High availability, fault tolerance, and disaster recovery
Deephaven is a distributed platform designed for high availability and fault tolerance. Effective resilience planning requires understanding how Deephaven's architecture handles failures, protects data, and ensures continuous operation. This guide provides a comprehensive framework for designing and managing a resilient Deephaven deployment.
The guide is organized into the following articles, each focusing on a critical aspect of resilience:
- Failure Modes: An analysis of potential hardware, software, and environmental failures, and strategies for disaster recovery.
- Services and Dependencies: A detailed look at the roles of core Deephaven services and their high-availability mechanisms.
- Data and Configuration Storage: An overview of how different types of data and configuration are stored and protected against loss.
- Data Ingestion: A guide to building resilient data ingestion pipelines that can withstand component failures.
These topics have some amount of overlap. In the context of this document:
- High availability means continued operations with short downtime and some manual actions needed by administrators and/or users.
- Fault tolerance means continued operations with no user awareness of an issue other than possibly degraded performance.
- Disaster recovery means resuming operations after a significant loss of data center functionality or connectivity.
Besides the consideration of system availability, there is also planning for recovery time, tolerable data loss, and what amount of data reloading, if any, is acceptable in case of a failure. Typically, these are referred to as:
- Recovery Time Objective (RTO) - maximum down time after a system failure.
- Recovery Point Objective (RPO) - maximum acceptable data loss from a system failure.