Introduction to etcd

Etcd is a distributed, reliable key-value store. Deephaven uses etcd as a consistent source of system data, including configuration, authentication, and Persistent Query state, and for implementing fault tolerance on its key services. Etcd is mature, battle-tested, and used as the cornerstone service to implement highly available distributed systems such as Kubernetes. It is open source and available on GitHub.

How Deephaven uses etcd

Deephaven takes full advantage of etcd's strong consistency and distributed nature to ensure reliable and fault-tolerant operations across Deephaven's services.

Configuration management

System properties, data routing, schema, and user settings are managed through the configuration service.
Access Control Lists (ACLs) are written by the ACL writer and read by relevant components.

Persistent Query (PQ) handling

The PQ controller service manages PQ information and state.
Fault tolerance is achieved through an elected controller leader and followers.

Authentication

The authentication service manages user authentication and tracks authentication state.
Fault tolerance is provided by stateless server replicas, with all state stored in etcd.
Ephemeral client authentication state is maintained using client cookies for active sessions.

System monitoring

Dispatcher liveness is monitored to ensure query workers are not abandoned during dispatcher crashes.
Controller leader liveness is tracked to prevent query worker isolation during election failures.

Dynamic endpoints

Processes without fixed addresses can register ephemeral addresses.
This is commonly used by in-worker DISes and table data service consumers.

Automated failover, consensus, and etcd

Historically, stores of state like SQL databases (e.g., MySQL) have provided replication between a single master and one or more replicas, and the means to manually "promote" a replica to master as a mechanism to achieve high availability. The operator decides when to promote and the choice of replica, including:

the need to automatically monitor and detect failure of the master.
the need to alert the failure to a human operator.
operator analysis to decide on failover need and select a healthy replica.
operator action to trigger failover to that specific replica.

As scale increased, the need for human intervention in the process became problematic due to the growing number of services requiring monitoring and the growing number of alerts for them, and the time required for human reaction. Google was early to address the need for automated failover in a reliable fully consistent store during the implementation of their core cluster services, which motivated the creation of Chubby. Later, Apache Zookeeper, etcd, and Consul provided open-source implementations of the same idea and were embraced by Internet companies outside of Google. Among these options, etcd offers a few benefits:

A modern design around raft, a newer, more modular consensus algorithm with a lot of support from the engineering community (Chubby and ZooKeeper are based on Paxos).
Client APIs designed around gRPC calls.
Simplified administration and reduced operational cost.
Wide industry adoption due to being a dependency for Kubernetes.

Underlying these implementations is the idea of distributed consensus: an algorithm that allows an odd number of servers to elect a leader automatically, without human intervention. The rest of the servers are "followers" and continue to run without acting on the state, but are ready to take over if the leader becomes unavailable. If an elected leader crashes or gets disconnected during a network outage, the followers that are still available run a new election and quickly promote one of themselves as the new leader. The implementation can advertise the new leader to the system's still available clients in an unambiguous way.

Using an odd number of servers simplifies reaching a majority and defines the minimum number of votes needed to win an election: half plus one, e.g., at least three votes in a set of five servers. This minimum number of votes also defines the minimum number of still available servers needed for the overall system to continue operation; if fewer servers than that number are available, the system has failed and stops. While this condition could be satisfied with any odd number, in practice, the number of servers is always 1, 3, or 5.

One server is only good for testing clients and local development as it doesn't provide any fault tolerance.
Three is the minimum number required for fault tolerance, allowing for at most one failure.
Five is the preferred number since it allows for bringing one server down for maintenance while still being able to survive a failure in another server.

Caution

Using consensus with seven or more servers is possible in theory, but it offers little benefit in terms of increased reliability for a comparatively high cost in latency, complexity, and increased risk due to a lack of testing. Five servers are advised for production systems.

Note

An important family of failures: network partitions Consensus solves the potential problem of a system partitioned in two, resulting in "split brain" where database state could diverge: two separate servers where each believes itself to be a master, and disjoint subsets of clients performing modifications on the server they are connected to. This risk is a big part of what motivated the need for human intervention in manual failover, where a common mantra for the operator was to ensure the former master was truly dead before promoting a replica.

In a service based on distributed consensus, there can be only one elected leader at a time. If servers and clients get on different sides of a network partition (where servers and clients in group 1 can reach each other but not servers and clients in group 2, and vice versa), then only one of these two groups has a majority of servers. Either the previous leader ends up on the majority group and manages to keep its elected status, or a new election is run in the majority group, and a new leader surfaces there. Every client and server in the majority group can continue to operate, and every client and server in the minority group knows it can't operate until the partition resolves.

A concrete example helps illustrate the issue. Consider racks in a datacenter that contain some number of machines each, say 7, with all machines in a rack connecting to a top of rack (TOR) switch. In turn, each TOR switch connects to a central switch. Machines on the same rack can talk to each other just by going through the TOR switch. To talk to a machine in a different rack, they need to go through the central switch. Now, imagine deploying a distributed system consisting of several servers and clients to machines in two of these racks. Now, sever the connection to the central switch for the TOR in one rack. This results in a network partition.

Etcd features

Consistent configuration management

Provides a fully consistent, highly available key-value store.
Ensures a uniform view of configuration across a distributed system.
Offers strong read-after-write consistency, solving a complex distributed systems problem.

Real-time updates

Supports subscriptions to changes, eliminating the need for periodic polling.
Allows clients to receive immediate notifications when values change.

Simplified fault tolerance

Leader election

Implemented by attempting to write to a specific key in etcd.
Only one server succeeds, ensuring only one leader at a time.

High availability

Key timeout if a server fails to "heartbeat".
Client subscriptions to monitor key changes.

Deephaven

Deephaven is able to leverage etcd to get fault tolerance properties without needing to implement sophisticated fault tolerance algorithms. For example:

A Deephaven cluster can elect a leader simply by issuing a write to a key on the consistent store, conditioned on the previous value being empty. This operation is guaranteed to succeed for only one of the servers that attempt it, and as such, it provides a mechanism for leader election that is simple to program and reason about.

Combining this feature with both...

Key timeout if a server fails to "heartbeat"
Client subscriptions to the key

...provides the building blocks for implementing high availability.