Replace an etcd node

Replacing an etcd node is a delicate operation; an error during the process may render the cluster unusable. If possible, perform this operation during periods of Deephaven system downtime (e.g., at night or after trading hours for a system supporting trading operations) to mitigate the risk.

Etcd node replacement procedure

Note

The commands shown in this guide must be run as the Deephaven administrative user. Use sudo or similar. The etcdctl.sh command is inside the /usr/illumon/latest/bin directory of the Deephaven installation; ensure that directory is in your path or modify the commands so that they can be executed from the correct location.

1. Preparation

The etcdctl.sh member list command lists the nodes in the cluster. The output of that command for an example cluster is shown below. Note: the recommended number of machines for a production etcd cluster is 5, but this simple example uses a three-machine etcd cluster.

etcdctl.sh member list -w table

Output:

+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
| 81b9c31827d6fcbb | started | etcd-2 | https://10.128.0.200:2380 | https://10.128.0.200:2379 |      false |
| 845d01a081fde043 | started | etcd-3 | https://10.128.0.203:2380 | https://10.128.0.203:2379 |      false |
| a43c4d038028f2c8 | started | etcd-1 | https://10.128.0.199:2380 | https://10.128.0.199:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+

This command must be executed on a machine with the complete etcd client configuration and access to a root account. An infrastructure node is suitable for this purpose, as query nodes typically do not have an etcd root client configuration. Additionally, machines running an Authentication Server or a Configuration Server can also be used for this operation.

To see detailed status about each available member, run etcdctl.sh endpoint status:

etcdctl.sh endpoint status -w table

Output:

+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.0.199:2379 | a43c4d038028f2c8 |  3.5.12 |  5.7 MB |     false |      false |         4 |       1553 |               1553 |        |
| https://10.128.0.200:2379 | 81b9c31827d6fcbb |  3.5.12 |  5.7 MB |      true |      false |         4 |       1554 |               1554 |        |
| https://10.128.0.203:2379 | 845d01a081fde043 |  3.5.12 |  5.7 MB |     false |      false |         4 |       1555 |               1555 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

As shown above, the context for the command output in these instructions starts with a system containing 3 healthy etcd nodes. To simulate a node failure, the third machine in the list (10.128.0.203) was shut down, creating a scenario similar to an actual node failure. Running the same command again with one etcd node down now outputs:

etcdctl.sh endpoint status -w table

Output:

{"level":"warn","ts":"2025-01-08T00:34:24.52219Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003da8c0/10.128.0.199:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint https://10.128.0.203:2379 (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.0.199:2379 | a43c4d038028f2c8 |  3.5.12 |  5.7 MB |     false |      false |         4 |       1583 |               1583 |        |
| https://10.128.0.200:2379 | 81b9c31827d6fcbb |  3.5.12 |  5.7 MB |      true |      false |         4 |       1584 |               1584 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Note

When a node is down, checking etcd status takes several seconds because it waits for a timeout. The timeout error message appears above the table in the output, and the node is no longer in the table.

Running the etcdct.sh endpoint health command shows the unhealthy node:

etcdctl.sh endpoint health -w table

Output:

{"level":"warn","ts":"2025-01-08T00:49:31.767844Z","logger":"client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003c1c00/10.128.0.203:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
+---------------------------+--------+-------------+---------------------------+
|         ENDPOINT          | HEALTH |    TOOK     |           ERROR           |
+---------------------------+--------+-------------+---------------------------+
| https://10.128.0.200:2379 |   true | 15.121479ms |                           |
| https://10.128.0.199:2379 |   true |  9.691899ms |                           |
| https://10.128.0.203:2379 |  false |             | context deadline exceeded |
+---------------------------+--------+-------------+---------------------------+
Error: unhealthy cluster

2. Remove the failed node

When a node becomes unavailable, etcd keeps it as part of the cluster definition because the node might reappear. For example, a transient network error might cause a node to become unavailable for a few minutes before coming back online. In this guide, we are considering a scenario where the node is not expected to return. Therefore, the first step is to remove the failed node from the current etcd cluster configuration.

To get the member ID of the failed node to remove from the output, run:

etcdctl.sh member list

Then, run the following command, passing in the ID of the failed node:

etcdctl.sh member remove 845d01a081fde043

Output:

Member 845d01a081fde043 removed from cluster 966b8ec752907e5b

Running etcdctl.sh member list again shows that the node is gone.

etcdctl.sh member list -w table

Output:

+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
| 81b9c31827d6fcbb | started | etcd-2 | https://10.128.0.200:2380 | https://10.128.0.200:2379 |      false |
| a43c4d038028f2c8 | started | etcd-1 | https://10.128.0.199:2380 | https://10.128.0.199:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+

3. Replace the failed node

To replace the failed node, you need a new machine to run the etcd node. You can either provision a new machine for this purpose or use an existing machine with sufficient capacity. This example assumes that a new machine will replace the failed one and be assigned the same IP address. Using the same IP address avoids the need to update the etcd client configuration on all Deephaven machines. If a different IP address is used, you must update the configuration later, as described in a later section of this guide.

The following steps assume that a Deephaven machine has been configured and deployed with the same IP address to replace the failed one, and the etcd binaries have been installed on the machine.

The new machine will lack etcd service configuration and service definition at this point. To get them:

  • Copy the configuration from an existing, working etcd node.
    • Take the contents of the /etc/etcd/dh directory and copy it to the new machine. Inside that directory, you will find a subdirectory with a unique cluster key; in our example, it is cdda65eca. This subdirectory contains configuration files for each etcd node named config-N.yaml, where N is a number between 1 and the total number of nodes. Additionally, there is a symbolic link called config.yaml that points to one of these files. Since you copied the directory from another machine, the symbolic link will point to the config file for that machine. Remove the symbolic link and recreate it to point to the config file corresponding to the new machine. You can identify the correct file by checking its contents: the IP address mentioned in the listen-client-urls around line 5 should match the IP address of the new machine. Permissions and ownership of symbolic links do not matter on Linux systems. However, if you want to maintain the original permissions on the new link, create it with the following command:
    sudo -u etcd -g irisadmin ln -s ...
    
    • Edit the config.yaml file (the symbolic link you recreated in the previous step points to the file) and change initial-cluster-state: new (approximately line 11) to read initial-cluster-state: existing.
    • Take the contents of /var/lib/etcd/dh and copy it to the new machine. Remove all the files in the resulting directory, only keep the directory structure with the right owners and permissions.
  • Create the systemctl service definition for dh-etcd. Run:
    /usr/illumon/latest/install/etcd/enable_dh_etcd_systemd.sh
    

4. Start the new etcd node as learner

A learner node is not a consensus participant in a cluster; it joins to learn the database state before participating in consensus. Once it receives the full database state from the participant nodes, it can be promoted. To start a node as a learner, you first need to add it to the cluster. On a machine with a root etcd client account, run the command below. The command is followed by an explanation of its different parts.

ETCDCTL_ENDPOINTS=https://10.128.0.199:2379,https://10.128.0.200:2379 etcdctl.sh member add etcd-3 --peer-urls=https://10.128.0.203:2380 --learner

Output:

Member 47b337abbab0351b added to cluster 966b8ec752907e5b
ETCD_NAME="etcd-3"
ETCD_INITIAL_CLUSTER="etcd-3=https://10.128.0.203:2380,etcd-2=https://10.128.0.200:2380,etcd-1=https://10.128.0.199:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.128.0.203:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
  • Define an environment variable ETCDCTL_ENDPOINTS that lists the endpoints of the surviving nodes. Since this example is running a cluster with one node removed, the default configuration for the etcdctl.sh command is incorrect. We need to override the valid endpoints. Defining the environment variable right before the command as above defines it just for the execution of that command (etcdctl.sh in our case).
  • Run the command member add and pass the name of the node to be replaced as the argument. This name should match the name the earlier etcdctl.sh member list gave for the failed node. The name is also in the first line of the config.yaml file (name: etcd-3 in our example).
  • The --peer-url= argument indicates the URL for the new learner node being added.
  • The IP address should match the IP of the machine being added.
  • The --learner argument indicates that the node is being added as a learner Note that the last line of the command output confirms an initial cluster state of existing, which explains the need to change that in the configuration file.

Listing the etcd members again shows one learner, not started:

ETCDCTL_ENDPOINTS=https://10.128.0.199:2379,https://10.128.0.200:2379 etcdctl.sh member list -w table

Output:

   +------------------+-----------+--------+---------------------------+---------------------------+------------+
   |        ID        |  STATUS   |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
   +------------------+-----------+--------+---------------------------+---------------------------+------------+
   | 47b337abbab0351b | unstarted |        | https://10.128.0.203:2380 |                           |       true |
   | 81b9c31827d6fcbb |   started | etcd-2 | https://10.128.0.200:2380 | https://10.128.0.200:2379 |      false |
   | a43c4d038028f2c8 |   started | etcd-1 | https://10.128.0.199:2380 | https://10.128.0.199:2379 |      false |
   +------------------+---------+--------+---------------------------+---------------------------+------------+

Now, you start the etcd service on the new machine. On the new machine run:

systemctl start dh-etcd

Return to a machine with a root etcd client account and list the members again. You should now see the learner as started.

ETCDCTL_ENDPOINTS=https://10.128.0.199:2379,https://10.128.0.200:2379 etcdctl.sh member list -w table

Output:

+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
| 47b337abbab0351b | started | etcd-3 | https://10.128.0.203:2380 | https://10.128.0.203:2379 |       true |
| 81b9c31827d6fcbb | started | etcd-2 | https://10.128.0.200:2380 | https://10.128.0.200:2379 |      false |
| a43c4d038028f2c8 | started | etcd-1 | https://10.128.0.199:2380 | https://10.128.0.199:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+

5. Promote the new learner node

Now, wait for the new learner node to catch up and promote it to a regular voting nod.

The learner node needs to re-create its database from the data in the other nodes, which may take some time. Unfortunately, etcd does not provide a mechanism to monitor and confirm when a learner node has fully caught up. As a workaround, you can check the file size of the db file located at /var/lib/etcd/dh/cdda65eca/member/snap/db. This file should grow in size faster as the learner node receives information from the other nodes, compared to once it is already caught up. Do not compare file sizes between nodes, however: compaction can cause long-running nodes to have unused space in the file that is used for new write requests instead of growing the file, so there is no guarantee their file sizes would match.

Once the learner node has caught up, use the promote command to promote the learner to a regular voting member:

ETCDCTL_ENDPOINTS=https://10.128.0.199:2379,https://10.128.0.200:2379 etcdctl.sh member promote 47b337abbab0351b

Output:

Member 47b337abbab0351b promoted in cluster 966b8ec752907e5b

There is no harm in trying to promote before the learner node is caught up, but in that case, the command will fail with the message:

Error: etcdserver: can only promote a learner member which is in sync with leader

Once the learner is promoted, you can return to using the defaults in etcdctl.sh without the need to define ETCDCTL_ENDPOINTS=... on every invocation. Listing the endpoints status should now show the full cluster.

etcdctl.sh endpoint status -w table

Output:

+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.0.199:2379 | a43c4d038028f2c8 |  3.5.12 |  5.8 MB |      true |      false |         5 |       1789 |               1789 |        |
| https://10.128.0.200:2379 | 81b9c31827d6fcbb |  3.5.12 |  5.8 MB |     false |      false |         5 |       1790 |               1790 |        |
| https://10.128.0.203:2379 | 47b337abbab0351b |  3.5.12 |  5.7 MB |     false |      false |         5 |       1791 |               1791 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

6. Revert the configuration to use initial-cluster-state: new

Edit the configuration in the new node under /etc/etcd/dh/latest/config.yaml and change initial-cluster-state: existing back to initial-cluster-state: new.

This concludes the replacement procedure.

Use a different IP address

The same procedure applies if an etcd node needs to be replaced with a machine with a different IP address. However, additional steps must be taken both before and after executing the procedure to ensure that you end up with a fully operational etcd cluster and a functioning Deephaven system.

Before replacing the node

The Deephaven system cannot use the new node without updating the etcd client configuration on each Deephaven machine. Although operations can continue with the remaining nodes, it is recommended to wait for a maintenance window when the Deephaven system can be halted to allow for the necessary configuration changes.

Replace the node

The procedure described can be applied by substituting the correct new IP address in the shell commands and in the config.yaml file for the new node.

After replacing the node

  • On each etcd node, edit the configuration in /etc/etcd/dh/latest/config.yaml and update the entry for initial-cluster: (around line 13) to list the correct set of IP addresses for the whole cluster, considering the one that was replaced. Note this file is only used during etcd startup, so the voting nodes that were running during the replacement procedure were not affected by this setting being wrong at the time the cluster was modified and the new learner was added.

  • With the Deephaven system down, on each Deephaven machine, find all the files named endpoints under the directory /etc/sysconfig/deephaven/etcd/client.

    This command gets a list of files:

    find /etc/sysconfig/deephaven/etcd/client -type f -name endpoints
    

    Modify all these files to replace the IP address.

    :::warning What follows are risky operations; ensure you create a backup of any files you are about to modify before actually changing them.

    As root, you can use a command similar to the one below to accomplish this. Note the command below is careful to avoid changing ownership and permissions of the files being modified:

    find /etc/sysconfig/deephaven/etcd/client -type f -name endpoints | \
        while read F; do \
          echo "== Modifying '$F' ..."
          sed -i 's/10\.128\.0\.203/10.128.0.209/g' "$F"
          echo "== Done modifying '$F'."
        done
    

    The example above replaces the old IP 10.128.0.203 with the new IP 10.128.0.209. Adjust those values to reflect the IPs in your case.

  • Generate new certificates and associated keys for the node, and distribute them to other nodes. The following instructions assume the naming and certificate generation settings for a default Deephaven installation.

    1. Set the environment variables:

      • ETCD_CONFIG_DIR: The directory where the etcd configuration lives.
      • NEW_IP_ADDR: The IP address of the new etcd replacement node.
      • CERT_DAYS: Number of days for the certificate validity period.
      • ETCD_SERVER_NUMBER: The number of the etcd server being replaced. In a 5-node cluster, this would be a number between 1 and 5, and should match the number used by the server being replaced. Most installations use a hostname that includes the server number as part of the machine name; otherwise, you can find the number looking at the position for the IP address of the server in the sequential order in the /etc/sysconfig/deephaven/etcd/client/root/endpoints file.

      Example values:

      ETCD_CONFIG_DIR=/etc/etcd/dh/cdda65eca
      NEW_IP_ADDR=10.128.0.209
      CERT_DAYS=3650
      ETCD_SERVER_NUMBER=2
      

      Then, as root, run:

      mkdir -p ${ETCD_CONFIG_DIR}/ssl/peer
      cd ${ETCD_CONFIG_DIR}/ssl/peer
      openssl genrsa 2048 > etcd-${ETCD_SERVER_NUMBER}.private.key
      SAN="IP:${NEW_IP_ADDR},DNS:peer.etcd.deephaven.local,DNS:etcd-${ETCD_SERVER_NUMBER}.peer.etcd.deephaven.local" \
        RSA_BITS="2048" \
        SERVER_CN="peer.etcd.deephaven.local" \
        ORGANIZATION="Deephaven Data Labs LLC" \
        ORGANIZATION_UNIT="Operations" \
        LOCATION="Colorado Springs" \
        STATE="Colorado" \
        COUNTRY="US" \
            openssl req \
                -config /usr/illumon/latest/install/etcd/peer.cnf \
                -x509 \
                -days "${CERT_DAYS}" \
                -key etcd-${ETCD_SERVER_NUMBER}.private.key \
                -out etcd-${ETCD_SERVER_NUMBER}.public.crt
      
      
      mkdir -p ${ETCD_CONFIG_DIR}/ssl/server
      cd ${ETCD_CONFIG_DIR}/ssl/server
      openssl genrsa 2048 > etcd-${ETCD_SERVER_NUMBER}.private.key
      SAN="IP:${NEW_IP_ADDR},DNS:server.etcd.deephaven.local,DNS:etcd-${ETCD_SERVER_NUMBER}.server.etcd.deephaven.local" \
        RSA_BITS="2048" \
        SERVER_CN="server.etcd.deephaven.local" \
        ORGANIZATION="Deephaven Data Labs LLC" \
        ORGANIZATION_UNIT="Operations" \
        LOCATION="Colorado Springs" \
        STATE="Colorado" \
        COUNTRY="US" \
            openssl req \
                -config /usr/illumon/latest/install/etcd/server.cnf \
                -x509 \
                -days "${CERT_DAYS}" \
                -key etcd-${ETCD_SERVER_NUMBER}.private.key \
                -out etcd-${ETCD_SERVER_NUMBER}.public.crt
      
    2. Step 1 generates a set of files corresponding to the server number that we are replacing. So, if the server number is 2, we should now have the following files:

      /etc/etcd/dh/c700a05c2/ssl/peer/etcd-2.private.key
      /etc/etcd/dh/c700a05c2/ssl/peer/etcd-2.public.crt
      /etc/etcd/dh/c700a05c2/ssl/server/etcd-2.private.key
      /etc/etcd/dh/c700a05c2/ssl/server/etcd-2.public.crt
      

      At this point, the files for the other server numbers need to be copied to this machine. Be careful not to replace the files we just generated—only copy the server and peer files corresponding to different servers. For example, in a 5-server cluster setup, we could copy the files for etcd-1, etcd-3, etcd-4, and etcd-5.

    3. On the new node and for each directory peer and server, we now need to generate the ca.crt file that is just a concatenation of all the crt files.

      cat /etc/etcd/dh/*/ssl/peer/*.public.crt > /etc/etcd/dh/*/ssl/peer/ca.crt
      cat /etc/etcd/dh/*/ssl/server/*.public.crt > /etc/etcd/dh/*/ssl/server/ca.crt
      
    4. The contents of the /etc/etcd/dh/*/ssl/peer and /etc/etcd/dh/*/ssl/server directories are now correct regarding all nodes, pre-existing nodes, and new nodes. Replace the content of those directories in the pre-existing nodes with the contents of the respective directories in the new node that we just generated.