Replace an etcd node

Replacing an etcd node is a delicate operation; an error during the process may render the cluster unusable. If possible, perform this operation during periods of Deephaven system downtime (e.g., at night or after trading hours for a system supporting trading operations) to mitigate the risk.

Etcd node replacement procedure

Note

In this guide, the administrative user is set in an environment variable:

export DH_ADMIN_USER=irisadmin

Set this value to whatever you use for the code blocks below to work.

1. Preparation

The command below lists the nodes in the cluster:

etcdctl.sh member list

The output of that command for an example cluster is shown below. Note: the recommended number of machines for a production etcd cluster is 5, but this simple example uses a three-machine etcd cluster.

etcdctl.sh member list -w table

Output:

+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
| 81b9c31827d6fcbb | started | etcd-2 | https://10.128.0.200:2380 | https://10.128.0.200:2379 |      false |
| 845d01a081fde043 | started | etcd-3 | https://10.128.0.203:2380 | https://10.128.0.203:2379 |      false |
| a43c4d038028f2c8 | started | etcd-1 | https://10.128.0.199:2380 | https://10.128.0.199:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+

This command must be executed on a machine with the complete etcd client configuration and access to a root account. An infrastructure node is suitable for this purpose, as query nodes typically do not have an etcd root client configuration. Additionally, machines running an Authentication Server or a Configuration Server can also be used for this operation.

To see detailed status about each available member, run etcdctl.sh endpoint status:

sudo -u $DH_ADMIN_USER etcdctl.sh endpoint status -w table

Output:

+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.0.199:2379 | a43c4d038028f2c8 |  3.5.12 |  5.7 MB |     false |      false |         4 |       1553 |               1553 |        |
| https://10.128.0.200:2379 | 81b9c31827d6fcbb |  3.5.12 |  5.7 MB |      true |      false |         4 |       1554 |               1554 |        |
| https://10.128.0.203:2379 | 845d01a081fde043 |  3.5.12 |  5.7 MB |     false |      false |         4 |       1555 |               1555 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

As shown above, the context for the command output in these instructions starts with a system containing 3 healthy etcd nodes. To simulate a node failure, the third machine in the list (10.128.0.203) was shut down, creating a scenario similar to an actual node failure. Running the same command again with one etcd node down now outputs:

sudo -u $DH_ADMIN_USER etcdctl.sh endpoint status -w table

Output:

{"level":"warn","ts":"2025-01-08T00:34:24.52219Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003da8c0/10.128.0.199:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint https://10.128.0.203:2379 (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.0.199:2379 | a43c4d038028f2c8 |  3.5.12 |  5.7 MB |     false |      false |         4 |       1583 |               1583 |        |
| https://10.128.0.200:2379 | 81b9c31827d6fcbb |  3.5.12 |  5.7 MB |      true |      false |         4 |       1584 |               1584 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Note

When a node is down, checking etcd status takes several seconds because it waits for a timeout. The timeout error message appears above the table in the output, and the node is no longer in the table.

Running the etcdct.sh endpoint health command shows the unhealthy node:

sudo -u $DH_ADMIN_USER etcdctl.sh endpoint health -w table

Output:

{"level":"warn","ts":"2025-01-08T00:49:31.767844Z","logger":"client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003c1c00/10.128.0.203:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
+---------------------------+--------+-------------+---------------------------+
|         ENDPOINT          | HEALTH |    TOOK     |           ERROR           |
+---------------------------+--------+-------------+---------------------------+
| https://10.128.0.200:2379 |   true | 15.121479ms |                           |
| https://10.128.0.199:2379 |   true |  9.691899ms |                           |
| https://10.128.0.203:2379 |  false |             | context deadline exceeded |
+---------------------------+--------+-------------+---------------------------+
Error: unhealthy cluster

2. Remove the failed node

When a node becomes unavailable, etcd keeps it as part of the cluster definition because the node might reappear. For example, a transient network error might cause a node to become unavailable for a few minutes before coming back online. In this guide, we are considering a scenario where the node is not expected to return. Therefore, the first step is to remove the failed node from the current etcd cluster configuration.

To get the member ID of the failed node to remove from the output, run:

etcdctl.sh member list

Then, run the following command, passing in the ID of the failed node:

sudo -u $DH_ADMIN_USER etcdctl.sh member remove 845d01a081fde043

Output:

Member 845d01a081fde043 removed from cluster 966b8ec752907e5b

Running etcdctl.sh member list again shows that the node is gone.

sudo -u $DH_ADMIN_USER etcdctl.sh member list -w table

Output:

+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
| 81b9c31827d6fcbb | started | etcd-2 | https://10.128.0.200:2380 | https://10.128.0.200:2379 |      false |
| a43c4d038028f2c8 | started | etcd-1 | https://10.128.0.199:2380 | https://10.128.0.199:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+

3. Replace the failed node

To replace the failed node, you need a new machine to run the etcd node. You can either provision a new machine for this purpose or use an existing machine with sufficient capacity. This example assumes that a new machine will replace the failed one and that it will be assigned the same IP address. Using the same IP address avoids the need to update the etcd client configuration on all Deephaven machines. If a different IP address is used, you will need to update the configuration later, as described in a later section of this guide.

The following steps assume that a Deephaven machine has been configured and deployed with the same IP address to replace the failed one, and that the Deephaven etcd package has been installed on the machine.

The new machine will have the etcd binary but will lack the configuration and service definition.

  • Copy the configuration from an existing, working etcd node.
    • Take the contents of the /etc/etcd/dh directory and copy it to the new machine. Inside that directory, you will find a subdirectory with a unique cluster key; in our example, it is cdda65eca. This subdirectory contains configuration files for each etcd node named config-N.yaml, where N is a number between 1 and the total number of nodes. Additionally, there is a symbolic link called config.yaml that points to one of these files. Since you copied the directory from another machine, the symbolic link will point to the config file for that machine. Remove the symbolic link and recreate it to point to the config file corresponding to the new machine. You can identify the correct file by checking its contents: the IP address mentioned in the listen-client-urls around line 5 should match the IP address of the new machine. Permissions and ownership of symbolic links do not matter on Linux systems. However, if you want to maintain the original permissions on the new link, create it with the following command:
    sudo -u etcd -g irisadmin ln -s ...
    
    • Edit the config.yaml file (the symbolic link you recreated in the previous step points to the file) and change initial-cluster-state: new (approximately line 11) to read initial-cluster-state: existing.
    • Take the contents of /var/lib/etcd/dh and copy it to the new machine. Remove all the files in the resulting directory, only keep the directory structure with the right owners and permissions.
  • Create the systemctl service definition for dh-etcd. Run:
    /usr/illumon/latest/install/etcd/enable_dh_etcd_systemd.sh
    

4. Start the new etcd node as learner

A learner node is not a consensus participant in a cluster; it joins to learn the database state before participating in consensus. Once it receives the full database state from the participant nodes, it can be promoted. To start a node as a learner, you first need to add it to the cluster. On a machine with a root etcd client account, run the command below. The command is followed by an explanation of its different parts.

sudo -u $DH_ADMIN_USER ETCDCTL_ENDPOINTS=https://10.128.0.199:2379,https://10.128.0.200:2379 etcdctl.sh member add etcd-3 --peer-urls=https://10.128.0.203:2380 --learner

Output:

Member 47b337abbab0351b added to cluster 966b8ec752907e5b
ETCD_NAME="etcd-3"
ETCD_INITIAL_CLUSTER="etcd-3=https://10.128.0.203:2380,etcd-2=https://10.128.0.200:2380,etcd-1=https://10.128.0.199:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.128.0.203:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
  • Define an environment variable ETCDCTL_ENDPOINTS that lists the endpoints of the surviving nodes. Since this example is running a cluster with one node removed, the default configuration for the etcdctl.sh command is incorrect. We need to override the valid endpoints. Defining the environment variable right before the command as above defines it just for the execution of that command (etcdctl.sh in our case).
  • Run the command member add and pass the name of the node to be replaced as the argument. This name should match the name the earlier etcdctl.sh member list gave for the failed node. The name is also in the first line of the config.yaml file (name: etcd-3 in our example).
  • The --peer-url= argument indicates the URL for the new learner node being added.
  • The IP address should match the IP of the machine being added.
  • The --learner argument indicates that the node is being added as a learner Note that the last line of the command output confirms an initial cluster state of existing, which explains the need to change that in the configuration file.

Listing the etcd members again shows one learner, not started:

sudo -u $DH_ADMIN_USER ETCDCTL_ENDPOINTS=https://10.128.0.199:2379,https://10.128.0.200:2379 etcdctl.sh member list -w table

Output:

   +------------------+-----------+--------+---------------------------+---------------------------+------------+
   |        ID        |  STATUS   |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
   +------------------+-----------+--------+---------------------------+---------------------------+------------+
   | 47b337abbab0351b | unstarted |        | https://10.128.0.203:2380 |                           |       true |
   | 81b9c31827d6fcbb |   started | etcd-2 | https://10.128.0.200:2380 | https://10.128.0.200:2379 |      false |
   | a43c4d038028f2c8 |   started | etcd-1 | https://10.128.0.199:2380 | https://10.128.0.199:2379 |      false |
   +------------------+---------+--------+---------------------------+---------------------------+------------+

Now, you start the etcd service on the new machine. On the new machine run:

sudo -u $DH_ADMIN_USER systemctl start dh-etcd

Return to a machine with a root etcd client account and list the members again. You should now see the learner as started.

sudo -u $DH_ADMIN_USER ETCDCTL_ENDPOINTS=https://10.128.0.199:2379,https://10.128.0.200:2379 etcdctl.sh member list -w table

Output:

+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
| 47b337abbab0351b | started | etcd-3 | https://10.128.0.203:2380 | https://10.128.0.203:2379 |       true |
| 81b9c31827d6fcbb | started | etcd-2 | https://10.128.0.200:2380 | https://10.128.0.200:2379 |      false |
| a43c4d038028f2c8 | started | etcd-1 | https://10.128.0.199:2380 | https://10.128.0.199:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+

5. Promote the new learner node

Now, wait for the new learner node to catch up and promote it to a regular voting nod.

The learner node needs to re-create its database from the data in the other nodes, which may take some time. Unfortunately, etcd does not provide a mechanism to monitor and confirm when a learner node has fully caught up. As a workaround, you can check the file size of the db file located at /var/lib/etcd/dh/cdda65eca/member/snap/db. This file should grow in size faster as the learner node receives information from the other nodes, compared to once it is already caught up. Do not compare file sizes between nodes, however: compaction can cause long-running nodes to have unused space in the file that is used for new write requests instead of growing the file, so there is no guarantee their file sizes would match.

Once the learner node has caught up, use the promote command to promote the learner to a regular voting member:

sudo -u $DH_ADMIN_USER ETCDCTL_ENDPOINTS=https://10.128.0.199:2379,https://10.128.0.200:2379 etcdctl.sh member promote 47b337abbab0351b

Output:

Member 47b337abbab0351b promoted in cluster 966b8ec752907e5b

There is no harm in trying to promote before the learner node is caught up, but in that case, the command will fail with the message:

Error: etcdserver: can only promote a learner member which is in sync with leader

Once the learner is promoted, you can return to using the defaults in etcdctl.sh without the need to define ETCDCTL_ENDPOINTS=... on every invocation. Listing the endpoints status should now show the full cluster.

sudo -u $DH_ADMIN_USER etcdctl.sh endpoint status -w table

Output:

+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.0.199:2379 | a43c4d038028f2c8 |  3.5.12 |  5.8 MB |      true |      false |         5 |       1789 |               1789 |        |
| https://10.128.0.200:2379 | 81b9c31827d6fcbb |  3.5.12 |  5.8 MB |     false |      false |         5 |       1790 |               1790 |        |
| https://10.128.0.203:2379 | 47b337abbab0351b |  3.5.12 |  5.7 MB |     false |      false |         5 |       1791 |               1791 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

6. Revert the configuration to use initial-cluster-state: new

Edit the configuration in the new node under /etc/etcd/dh/latest/config.yaml and change initial-cluster-state: existing back to initial-cluster-state: new.

This concludes the replacement procedure.

Using a different IP address

If an etcd node needs to be replaced with a machine that has a different IP address, the same procedure applies. However, additional steps must be taken both before and after executing the procedure to ensure that you end up with a fully operational etcd cluster and a functioning Deephaven system.

Before replacing the node

The Deephaven system cannot use the new node without updating the etcd client configuration on each Deephaven machine. Although operations can continue with the remaining nodes, it is recommended to wait for a maintenance window when the Deephaven system can be halted to allow for the necessary configuration changes.

Replacing the node

The procedure described can be applied by substituting the correct new IP address in the shell commands and in the config.yaml file for the new node.

After replacing the node

  • On each etcd node, edit the configuration in /etc/etcd/dh/latest/config.yaml and update the entry for initial-cluster: (around line 13) to list the correct set of IP addresses for the whole cluster, considering the one that was replaced. Note this file is only used during etcd startup, so the voting nodes that were running during the replacement procedure were not affected by this setting being wrong at the time the cluster was modified and the new learner was added.

  • With the Deephaven system down, on each Deephaven machine, find all the files named endpoints under the directory /etc/sysconfig/deephaven/etcd/client.

    This command gets a list of files:

    find /etc/sysconfig/deephaven/etcd/client -type f -name endpoints
    

    Modify all these files to replace the IP address.

    As root, you can use a command similar to the one below to accomplish this. Note the command below is careful to avoid changing ownership and permissions of the files being modified:

    find /etc/sysconfig/deephaven/etcd/client -type f -name endpoints | \
        while read F; do \
          echo "== Modifying '$F' ..."
          sed -i 's/10\.128\.0\.203/10.128.0.209/g' "$F"
          echo "== Done modifying '$F'."
        done
    

    The example above replaces the old IP 10.128.0.203 with the new IP 10.128.0.209. Adjust those values to reflect the IPs in your case.