Replace an etcd node
Replacing an etcd node is a delicate operation; an error during the process may render the cluster unusable. If possible, perform this operation during periods of Deephaven system downtime (e.g., at night or after trading hours for a system supporting trading operations) to mitigate the risk.
Etcd node replacement procedure
Note
The commands shown in this guide must be run as the Deephaven administrative user. Use sudo
or similar.
The etcdctl.sh
command is inside the /usr/illumon/latest/bin
directory of the Deephaven installation; ensure that directory is in your path or modify the commands so that they can be executed from the correct location.
1. Preparation
The etcdctl.sh member list
command lists the nodes in the cluster.
The output of that command for an example cluster is shown below. Note: the recommended number of machines for a production etcd cluster is 5, but this simple example uses a three-machine etcd cluster.
etcdctl.sh member list -w table
Output:
+------------------+---------+--------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
| 81b9c31827d6fcbb | started | etcd-2 | https://10.128.0.200:2380 | https://10.128.0.200:2379 | false |
| 845d01a081fde043 | started | etcd-3 | https://10.128.0.203:2380 | https://10.128.0.203:2379 | false |
| a43c4d038028f2c8 | started | etcd-1 | https://10.128.0.199:2380 | https://10.128.0.199:2379 | false |
+------------------+---------+--------+---------------------------+---------------------------+------------+
This command must be executed on a machine with the complete etcd client configuration and access to a root account. An infrastructure node is suitable for this purpose, as query nodes typically do not have an etcd root client configuration. Additionally, machines running an Authentication Server or a Configuration Server can also be used for this operation.
To see detailed status about each available member, run etcdctl.sh endpoint status
:
etcdctl.sh endpoint status -w table
Output:
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.0.199:2379 | a43c4d038028f2c8 | 3.5.12 | 5.7 MB | false | false | 4 | 1553 | 1553 | |
| https://10.128.0.200:2379 | 81b9c31827d6fcbb | 3.5.12 | 5.7 MB | true | false | 4 | 1554 | 1554 | |
| https://10.128.0.203:2379 | 845d01a081fde043 | 3.5.12 | 5.7 MB | false | false | 4 | 1555 | 1555 | |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
As shown above, the context for the command output in these instructions starts with a system containing 3 healthy etcd nodes. To simulate a node failure, the third machine in the list (10.128.0.203) was shut down, creating a scenario similar to an actual node failure. Running the same command again with one etcd node down now outputs:
etcdctl.sh endpoint status -w table
Output:
{"level":"warn","ts":"2025-01-08T00:34:24.52219Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003da8c0/10.128.0.199:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint https://10.128.0.203:2379 (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.0.199:2379 | a43c4d038028f2c8 | 3.5.12 | 5.7 MB | false | false | 4 | 1583 | 1583 | |
| https://10.128.0.200:2379 | 81b9c31827d6fcbb | 3.5.12 | 5.7 MB | true | false | 4 | 1584 | 1584 | |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Note
When a node is down, checking etcd status takes several seconds because it waits for a timeout. The timeout error message appears above the table in the output, and the node is no longer in the table.
Running the etcdct.sh endpoint health
command shows the unhealthy node:
etcdctl.sh endpoint health -w table
Output:
{"level":"warn","ts":"2025-01-08T00:49:31.767844Z","logger":"client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003c1c00/10.128.0.203:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
+---------------------------+--------+-------------+---------------------------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+---------------------------+--------+-------------+---------------------------+
| https://10.128.0.200:2379 | true | 15.121479ms | |
| https://10.128.0.199:2379 | true | 9.691899ms | |
| https://10.128.0.203:2379 | false | | context deadline exceeded |
+---------------------------+--------+-------------+---------------------------+
Error: unhealthy cluster
2. Remove the failed node
When a node becomes unavailable, etcd keeps it as part of the cluster definition because the node might reappear. For example, a transient network error might cause a node to become unavailable for a few minutes before coming back online. In this guide, we are considering a scenario where the node is not expected to return. Therefore, the first step is to remove the failed node from the current etcd cluster configuration.
To get the member ID of the failed node to remove from the output, run:
etcdctl.sh member list
Then, run the following command, passing in the ID of the failed node:
etcdctl.sh member remove 845d01a081fde043
Output:
Member 845d01a081fde043 removed from cluster 966b8ec752907e5b
Running etcdctl.sh member list
again shows that the node is gone.
etcdctl.sh member list -w table
Output:
+------------------+---------+--------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
| 81b9c31827d6fcbb | started | etcd-2 | https://10.128.0.200:2380 | https://10.128.0.200:2379 | false |
| a43c4d038028f2c8 | started | etcd-1 | https://10.128.0.199:2380 | https://10.128.0.199:2379 | false |
+------------------+---------+--------+---------------------------+---------------------------+------------+
3. Replace the failed node
To replace the failed node, you need a new machine to run the etcd node. You can either provision a new machine for this purpose or use an existing machine with sufficient capacity. This example assumes that a new machine will replace the failed one and be assigned the same IP address. Using the same IP address avoids the need to update the etcd client configuration on all Deephaven machines. If a different IP address is used, you must update the configuration later, as described in a later section of this guide.
The following steps assume that a Deephaven machine has been configured and deployed with the same IP address to replace the failed one, and the etcd binaries have been installed on the machine.
The new machine will lack etcd service configuration and service definition at this point. To get them:
- Copy the configuration from an existing, working etcd node.
- Take the contents of the
/etc/etcd/dh directory
and copy it to the new machine. Inside that directory, you will find a subdirectory with a unique cluster key; in our example, it iscdda65eca
. This subdirectory contains configuration files for each etcd node namedconfig-N.yaml
, whereN
is a number between 1 and the total number of nodes. Additionally, there is a symbolic link calledconfig.yaml
that points to one of these files. Since you copied the directory from another machine, the symbolic link will point to the config file for that machine. Remove the symbolic link and recreate it to point to the config file corresponding to the new machine. You can identify the correct file by checking its contents: the IP address mentioned in thelisten-client-urls
around line 5 should match the IP address of the new machine. Permissions and ownership of symbolic links do not matter on Linux systems. However, if you want to maintain the original permissions on the new link, create it with the following command:
sudo -u etcd -g irisadmin ln -s ...
- Edit the
config.yaml
file (the symbolic link you recreated in the previous step points to the file) and changeinitial-cluster-state: new
(approximately line 11) to readinitial-cluster-state: existing
. - Take the contents of
/var/lib/etcd/dh
and copy it to the new machine. Remove all the files in the resulting directory, only keep the directory structure with the right owners and permissions.
- Take the contents of the
- Create the
systemctl
service definition for dh-etcd. Run:/usr/illumon/latest/install/etcd/enable_dh_etcd_systemd.sh
4. Start the new etcd node as learner
A learner node is not a consensus participant in a cluster; it joins to learn the database state before participating in consensus. Once it receives the full database state from the participant nodes, it can be promoted. To start a node as a learner, you first need to add it to the cluster. On a machine with a root etcd client account, run the command below. The command is followed by an explanation of its different parts.
ETCDCTL_ENDPOINTS=https://10.128.0.199:2379,https://10.128.0.200:2379 etcdctl.sh member add etcd-3 --peer-urls=https://10.128.0.203:2380 --learner
Output:
Member 47b337abbab0351b added to cluster 966b8ec752907e5b
ETCD_NAME="etcd-3"
ETCD_INITIAL_CLUSTER="etcd-3=https://10.128.0.203:2380,etcd-2=https://10.128.0.200:2380,etcd-1=https://10.128.0.199:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.128.0.203:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
- Define an environment variable
ETCDCTL_ENDPOINTS
that lists the endpoints of the surviving nodes. Since this example is running a cluster with one node removed, the default configuration for theetcdctl.sh
command is incorrect. We need to override the valid endpoints. Defining the environment variable right before the command as above defines it just for the execution of that command (etcdctl.sh
in our case). - Run the command
member add
and pass the name of the node to be replaced as the argument. This name should match the name the earlieretcdctl.sh member list
gave for the failed node. The name is also in the first line of theconfig.yaml
file (name: etcd-3
in our example). - The
--peer-url=
argument indicates the URL for the new learner node being added. - The IP address should match the IP of the machine being added.
- The
--learner
argument indicates that the node is being added as a learner Note that the last line of the command output confirms an initial cluster state ofexisting
, which explains the need to change that in the configuration file.
Listing the etcd members again shows one learner, not started:
ETCDCTL_ENDPOINTS=https://10.128.0.199:2379,https://10.128.0.200:2379 etcdctl.sh member list -w table
Output:
+------------------+-----------+--------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+-----------+--------+---------------------------+---------------------------+------------+
| 47b337abbab0351b | unstarted | | https://10.128.0.203:2380 | | true |
| 81b9c31827d6fcbb | started | etcd-2 | https://10.128.0.200:2380 | https://10.128.0.200:2379 | false |
| a43c4d038028f2c8 | started | etcd-1 | https://10.128.0.199:2380 | https://10.128.0.199:2379 | false |
+------------------+---------+--------+---------------------------+---------------------------+------------+
Now, you start the etcd service on the new machine. On the new machine run:
systemctl start dh-etcd
Return to a machine with a root etcd client account and list the members again. You should now see the learner as started.
ETCDCTL_ENDPOINTS=https://10.128.0.199:2379,https://10.128.0.200:2379 etcdctl.sh member list -w table
Output:
+------------------+---------+--------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
| 47b337abbab0351b | started | etcd-3 | https://10.128.0.203:2380 | https://10.128.0.203:2379 | true |
| 81b9c31827d6fcbb | started | etcd-2 | https://10.128.0.200:2380 | https://10.128.0.200:2379 | false |
| a43c4d038028f2c8 | started | etcd-1 | https://10.128.0.199:2380 | https://10.128.0.199:2379 | false |
+------------------+---------+--------+---------------------------+---------------------------+------------+
5. Promote the new learner node
Now, wait for the new learner node to catch up and promote it to a regular voting nod.
The learner node needs to re-create its database from the data in the other nodes, which may take some time. Unfortunately, etcd does not provide a mechanism to monitor and confirm when a learner node has fully caught up. As a workaround, you can check the file size of the db
file located at /var/lib/etcd/dh/cdda65eca/member/snap/db
. This file should grow in size faster as the learner node receives information from the other nodes, compared to once it is already caught up. Do not compare file sizes between nodes, however: compaction can cause long-running nodes to have unused space in the file that is used for new write requests instead of growing the file, so there is no guarantee their file sizes would match.
Once the learner node has caught up, use the promote
command to promote the learner to a regular voting member:
ETCDCTL_ENDPOINTS=https://10.128.0.199:2379,https://10.128.0.200:2379 etcdctl.sh member promote 47b337abbab0351b
Output:
Member 47b337abbab0351b promoted in cluster 966b8ec752907e5b
There is no harm in trying to promote before the learner node is caught up, but in that case, the command will fail with the message:
Error: etcdserver: can only promote a learner member which is in sync with leader
Once the learner is promoted, you can return to using the defaults in etcdctl.sh
without the need to define ETCDCTL_ENDPOINTS=...
on every invocation. Listing the endpoints status should now show the full cluster.
etcdctl.sh endpoint status -w table
Output:
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.0.199:2379 | a43c4d038028f2c8 | 3.5.12 | 5.8 MB | true | false | 5 | 1789 | 1789 | |
| https://10.128.0.200:2379 | 81b9c31827d6fcbb | 3.5.12 | 5.8 MB | false | false | 5 | 1790 | 1790 | |
| https://10.128.0.203:2379 | 47b337abbab0351b | 3.5.12 | 5.7 MB | false | false | 5 | 1791 | 1791 | |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
6. Revert the configuration to use initial-cluster-state: new
Edit the configuration in the new node under /etc/etcd/dh/latest/config.yaml
and change initial-cluster-state: existing
back to initial-cluster-state: new
.
This concludes the replacement procedure.
Use a different IP address
The same procedure applies if an etcd node needs to be replaced with a machine with a different IP address. However, additional steps must be taken both before and after executing the procedure to ensure that you end up with a fully operational etcd cluster and a functioning Deephaven system.
Before replacing the node
The Deephaven system cannot use the new node without updating the etcd client configuration on each Deephaven machine. Although operations can continue with the remaining nodes, it is recommended to wait for a maintenance window when the Deephaven system can be halted to allow for the necessary configuration changes.
Replace the node
The procedure described can be applied by substituting the correct new IP address in the shell commands and in the config.yaml
file for the new node.
After replacing the node
-
On each etcd node, edit the configuration in
/etc/etcd/dh/latest/config.yaml
and update the entry forinitial-cluster:
(around line 13) to list the correct set of IP addresses for the whole cluster, considering the one that was replaced. Note this file is only used during etcd startup, so the voting nodes that were running during the replacement procedure were not affected by this setting being wrong at the time the cluster was modified and the new learner was added. -
With the Deephaven system down, on each Deephaven machine, find all the files named
endpoints
under the directory/etc/sysconfig/deephaven/etcd/client
.This command gets a list of files:
find /etc/sysconfig/deephaven/etcd/client -type f -name endpoints
Modify all these files to replace the IP address.
:::warning What follows are risky operations; ensure you create a backup of any files you are about to modify before actually changing them.
As
root
, you can use a command similar to the one below to accomplish this. Note the command below is careful to avoid changing ownership and permissions of the files being modified:find /etc/sysconfig/deephaven/etcd/client -type f -name endpoints | \ while read F; do \ echo "== Modifying '$F' ..." sed -i 's/10\.128\.0\.203/10.128.0.209/g' "$F" echo "== Done modifying '$F'." done
The example above replaces the old IP
10.128.0.203
with the new IP10.128.0.209
. Adjust those values to reflect the IPs in your case. -
Generate new certificates and associated keys for the node, and distribute them to other nodes. The following instructions assume the naming and certificate generation settings for a default Deephaven installation.
-
Set the environment variables:
ETCD_CONFIG_DIR
: The directory where the etcd configuration lives.NEW_IP_ADDR
: The IP address of the new etcd replacement node.CERT_DAYS
: Number of days for the certificate validity period.ETCD_SERVER_NUMBER
: The number of the etcd server being replaced. In a 5-node cluster, this would be a number between 1 and 5, and should match the number used by the server being replaced. Most installations use a hostname that includes the server number as part of the machine name; otherwise, you can find the number looking at the position for the IP address of the server in the sequential order in the/etc/sysconfig/deephaven/etcd/client/root/endpoints
file.
Example values:
ETCD_CONFIG_DIR=/etc/etcd/dh/cdda65eca NEW_IP_ADDR=10.128.0.209 CERT_DAYS=3650 ETCD_SERVER_NUMBER=2
Then, as
root
, run:mkdir -p ${ETCD_CONFIG_DIR}/ssl/peer cd ${ETCD_CONFIG_DIR}/ssl/peer openssl genrsa 2048 > etcd-${ETCD_SERVER_NUMBER}.private.key SAN="IP:${NEW_IP_ADDR},DNS:peer.etcd.deephaven.local,DNS:etcd-${ETCD_SERVER_NUMBER}.peer.etcd.deephaven.local" \ RSA_BITS="2048" \ SERVER_CN="peer.etcd.deephaven.local" \ ORGANIZATION="Deephaven Data Labs LLC" \ ORGANIZATION_UNIT="Operations" \ LOCATION="Colorado Springs" \ STATE="Colorado" \ COUNTRY="US" \ openssl req \ -config /usr/illumon/latest/install/etcd/peer.cnf \ -x509 \ -days "${CERT_DAYS}" \ -key etcd-${ETCD_SERVER_NUMBER}.private.key \ -out etcd-${ETCD_SERVER_NUMBER}.public.crt mkdir -p ${ETCD_CONFIG_DIR}/ssl/server cd ${ETCD_CONFIG_DIR}/ssl/server openssl genrsa 2048 > etcd-${ETCD_SERVER_NUMBER}.private.key SAN="IP:${NEW_IP_ADDR},DNS:server.etcd.deephaven.local,DNS:etcd-${ETCD_SERVER_NUMBER}.server.etcd.deephaven.local" \ RSA_BITS="2048" \ SERVER_CN="server.etcd.deephaven.local" \ ORGANIZATION="Deephaven Data Labs LLC" \ ORGANIZATION_UNIT="Operations" \ LOCATION="Colorado Springs" \ STATE="Colorado" \ COUNTRY="US" \ openssl req \ -config /usr/illumon/latest/install/etcd/server.cnf \ -x509 \ -days "${CERT_DAYS}" \ -key etcd-${ETCD_SERVER_NUMBER}.private.key \ -out etcd-${ETCD_SERVER_NUMBER}.public.crt
-
Step 1 generates a set of files corresponding to the server number that we are replacing. So, if the server number is 2, we should now have the following files:
/etc/etcd/dh/c700a05c2/ssl/peer/etcd-2.private.key /etc/etcd/dh/c700a05c2/ssl/peer/etcd-2.public.crt /etc/etcd/dh/c700a05c2/ssl/server/etcd-2.private.key /etc/etcd/dh/c700a05c2/ssl/server/etcd-2.public.crt
At this point, the files for the other server numbers need to be copied to this machine. Be careful not to replace the files we just generated—only copy the server and peer files corresponding to different servers. For example, in a 5-server cluster setup, we could copy the files for etcd-1, etcd-3, etcd-4, and etcd-5.
-
On the new node and for each directory peer and server, we now need to generate the
ca.crt
file that is just a concatenation of all the crt files.cat /etc/etcd/dh/*/ssl/peer/*.public.crt > /etc/etcd/dh/*/ssl/peer/ca.crt cat /etc/etcd/dh/*/ssl/server/*.public.crt > /etc/etcd/dh/*/ssl/server/ca.crt
-
The contents of the
/etc/etcd/dh/*/ssl/peer
and/etc/etcd/dh/*/ssl/server
directories are now correct regarding all nodes, pre-existing nodes, and new nodes. Replace the content of those directories in the pre-existing nodes with the contents of the respective directories in the new node that we just generated.
-