Deephaven Etcd Cluster Recovery Guide

Rebuilding Etcd after Quorum Loss

An Etcd cluster suffers “quorum loss” once 50% or more of the machines in the cluster are in an unhealthy state. Once a cluster is in the state of quorum loss, for example, if more than half the machines are lost or change their IP addresses, the etcd cluster is considered irrecoverable, and a new cluster must be rebuilt from the old one.

While it is possible to forcibly resize the cluster down to a single node, then add back individual nodes, this process is complex, not always possible, and differs based on the version of etcd you are using. The simpler and most ubiquitous method of recovery is to rebuild a new cluster using a backup snapshot of the etcd database.

This guide will detail how to manually take regular snapshots, and how to recover your Deephaven etcd cluster from said snapshots.

Step 0: Take regular snapshots of your etcd cluster

Before you can recover an etcd cluster from quorum loss, you must have a snapshot of the etcd database, which you can obtain like so:

/usr/illumon/latest/bin/etcdctl.sh snapshot save /tmp/etcd.snap

The file written to /tmp/etcd.snap contains everything you need to recover your cluster, and it is recommended to store a nightly backup in a secure location.

In case of emergencies where you do not already have a backup available, you may be able to retrieve a snapshot from any remaining functional etcd node, or as a last resort, by copying files from the disk of your etcd server.

To query a single server with etcdctl.sh, just add --endpoints=${YOUR_ETCD_IP}:2379 to the snapshot save command above.
To find an on-disk snapshot, look in /var/lib/etcd/dh/*/member/snap/db, where * is your etcd cluster token.
- To see your etcd cluster token, grep cluster-token /etc/etcd/dh/latest/config.yaml
- You should only use the files on disk as a last resort because they lack the necessary checksums to ensure the integrity of the database file.

When restoring your etcd cluster, you should upload your backup file to each machine that will be participating as an etcd server in your cluster. We will assume the location of /tmp/etcd.snap throughout this documentation.

Step 1: Remove any existing dh-etcd systemctl service

In cases where your etcd machines are still functional, but you wish, for example, to change the IP address of that machine using a separate interface, you should first log in to each machine and remove the current dh-etcd.service.

sudo systemctl stop dh-etcd
sudo systemctl disable dh-etcd
sudo systemctl daemon-reload

It is also strongly recommended to run /usr/illumon/latest/bin/dh_monit down --block on all Deephaven servers, to completely turn all Deephaven process off during the etcd disaster recovery process.

Step 2: Generate and distribute new etcd cluster configuration

Because Deephaven etcd clusters use peer and server ssl certificates, it is necessary to regenerate and redistribute the complete set of cluster configuration files at one time.

It is also necessary to manually create a backup of your etcd client passwords as part of this process. The newly generated ssl certificates and IP address endpoints files will be created in an archive which also contains new etcd user account passwords that will not work with a rebuilt etcd cluster. The script snippet below shows you how to reconcile this matter by keeping the existing passwords:

cluster_id=newclusterid
ip_1=10.128.1.1
ip_2=10.128.1.2
ip_3=10.128.1.3
etcd_ips="$ip_1 $ip_2 $ip_3"

# First: back up existing passwords the machine running your configuration_server monit service.
# you can hardcode any backup directory you prefer.
backup="$(mktemp -d "/etc/sysconfig/deephaven/backups/etcd/client-$(date +"%FT%H%M%S.%5N").XXXXX")"
sudo -u irisadmin -g irisadmin mkdir -p "$backup"
# create a new .tgz file with all your etcd passwords in it. Keep this file somewhere safe!
(
cd /etc/sysconfig/deephaven/etcd/client/
sudo tar -czf $backup/backup.tgz -T /dev/stdin < <(find . -name password)
)
echo "Saved passwords to $backup/backup.tgz"
echo "BE SURE TO \`shred -u $backup/backup.tgz\` when you are done."

# Next, generate a new 3+ node cluster:
mkdir -p /tmp/$cluster_id
cd /tmp/$cluster_id
/usr/illumon/latest/install/etcd/config_generator.sh --self-signed --servers $etcd_ips \
    --cluster-id "$cluster_id" --root-user-dhadmin

# Now, unpackage the etcd client keys (which have updated ca certificates and endpoints)
cd /etc/sysconfig/deephaven/etcd
/usr/illumon/latest/install/config_packager.sh etcd unpackage-client

# Put back the original passwords.
(
cd /etc/sysconfig/deephaven/etcd/client/
tar -xzf $backup/backup.tgz
)

# Create a query config package in the /tmp/$cluster_id directory.
cd /tmp/$cluster_id
/usr/illumon/latest/install/config_packager.sh query package

# Finally (optional), destroy the backed-up password tar file, after you have fully restored your cluster.
shred -u $backup/backup.tgz

Once the config_generator.sh command has been run, a newly created etcd cluster configuration tar file will be located at /etc/sysconfig/deephaven/etcd/dh_etcd_config.tgz, you must copy this file to each etcd server (typically 3-5 machines).

Additionally, there will be a /tmp/$cluster_id/dh_query.tgz “query config” package that is created, which must be copied-to and unpackaged-on all machines running any Deephaven processes.

Copy both of these files from your configuration server machine to any machine capable of transferring files between the other machines of your Deephaven cluster.

Step 3: Distribute the new etcd configuration files

To unpackage the etcd server configuration files:

# copy dh_etcd_config.tgz to /tmp/dh_etcd_config.tgz on each etcd server
cd /tmp
/usr/illumon/latest/install/config_packager.sh etcd unpackage-server <SERVER_NUMBER>
# update the cluster state from 'new' to 'existing', so nobody creates new internal etcd IDs
sed -i -e 's/initial-cluster-state: new/initial-cluster-state: existing/' /etc/etcd/dh/latest/config.yaml

The <SERVER_NUMBER> is the 1-indexed “etcd server number” based on the etcd_ips="$ip_1 $ip_2 $ip_3" list passed to config_generator.sh in the above snippet. On machine matching ip_1, your <SERVER_NUMBER> is 1.

To unpackage the etcd client configuration files:

# copy dh_query.tgz to /tmp/dh_query.tgz on all Deephaven servers
cd /tmp
/usr/illumon/latest/install/config_packager.sh query unpackage

You must include the configuration_server node who created the query package when performing this unpackage operation.

Step 4: Restore the etcd servers from snapshot

Once you have unpackaged the etcd client and server configuration files, your final step is to restore and start the etcd servers.

On each of the etcd servers (we will assume a 3 node cluster here, but 5 servers is recommended)

Full code to run on etcd ip_1 machine:

ip_1=10.128.1.181
ip_2=10.128.1.183
ip_3=10.128.1.184
cluster_id=bestcluster
# the stop and disable below might fail if you already disabled dh-etcd service
systemctl stop dh-etcd
systemctl disable dh-etcd
systemctl daemon-reload
rm -rf /var/lib/etcd/dh/${cluster_id}.bak
mv /var/lib/etcd/dh/${cluster_id} /var/lib/etcd/dh/${cluster_id}.bak
# the above 9 lines should be run on each etcd server before performing the recovery below:
sudo -u etcd -g irisadmin ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd.snap \
  --name etcd-1 \
  --initial-cluster etcd-1=https://${ip_1}:2380,etcd-2=https://${ip_2}:2380,etcd-3=https://${ip_3}:2380 \
  --initial-cluster-token ${cluster_id} \
  --initial-advertise-peer-urls https://${ip_1}:2380 \
  --data-dir=/var/lib/etcd/dh/${cluster_id}
/usr/illumon/latest/install/etcd/enable_dh_etcd_systemd.sh --sequential

The first nine lines of the above code snippet are the same for all machines, and will be obviated for brevity.

Snippet to run on etcd ip_2 machine:

# run the first nine lines from the full snippet above
sudo -u etcd -g irisadmin ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd.snap \
  --name etcd-2 \
  --initial-cluster etcd-1=https://${ip_1}:2380,etcd-2=https://${ip_2}:2380,etcd-3=https://${ip_3}:2380 \
  --initial-cluster-token ${cluster_id} \
  --initial-advertise-peer-urls https://${ip_2}:2380 \
  --data-dir=/var/lib/etcd/dh/${cluster_id}
/usr/illumon/latest/install/etcd/enable_dh_etcd_systemd.sh --sequential

Snippet to run on etcd ip_3 machine:

sudo -u etcd -g irisadmin ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd.snap \
  --name etcd-3 \
  --initial-cluster etcd-1=https://${ip_1}:2380,etcd-2=https://${ip_2}:2380,etcd-3=https://${ip_3}:2380 \
  --initial-cluster-token ${cluster_id} \
  --initial-advertise-peer-urls https://${ip_3}:2380 \
  --data-dir=/var/lib/etcd/dh/${cluster_id}
/usr/illumon/latest/install/etcd/enable_dh_etcd_systemd.sh --sequential

Step 5: Turn everything back on

Once the above process is complete, you should expect a clean bill of health from your etcd cluster.

Shell into your configuration_server node and run the following commands to test:

sudo /usr/illumon/latest/bin/etcdctl.sh endpoint health
# expect 3-5 machines all reporting Healthy status

A healthy cluster should display:

https://10.128.0.1:2379 is healthy: successfully committed proposal: took = 5.538869ms
https://10.128.0.2:2379 is healthy: successfully committed proposal: took = 7.664529ms
https://10.128.0.3:2379 is healthy: successfully committed proposal: took = 3.74804ms

Verify that your old data is present in your new etcd cluster:

sudo /usr/illumon/latest/bin/etcdctl.sh get --keys-only --prefix / | wc -l

21640

On a healthy, restored system, there should be tens of thousands of entries found. You may wish to check a subset, like /main/config/props which should only have a few hundred results.

Assuming the above commands both work correctly, turn monit back on across your entire Deephaven cluster:

/usr/illumon/latest/bin/dh_monit up --block

Troubleshooting

In the event that things do not go smoothly for you, the following suggestions can help in solving problems.

Enable etcd debug logging

On your etcd server, edit the file /etc/etcd/dh/latest/config.yaml:

log-level: debug

In versions of etcd older than 3.5, you must set debug: true instead of log-level: debug.

Enabling debug logging causes journalctl to report more verbose log messages:

sudo journalctl -xafeu dh-etcd

For installations with restricted sudo permissions, you may need to add --no-pager, or otherwise format your command based on the sudoers rules reported by sudo -l -U your_user (note that the -U is capitalized).

The journalctl command above shows only dh-etcd process output. However, some additional logging related to etcd is only visible in the full journalctl log (no -u flag):

sudo journalctl -xafe

Running etcd directly

Even with debug logging to journalctl, some interesting output can be lost when using systemd, so you may wish to temporarily turn it off and invoke etcd directly.

sudo -u etcd bash -c "$(grep ExecStart /etc/etcd/dh/latest/dh-etcd.service | sed  s/ExecStart=//)"

The above snippet is a bash-one-liner for “open the file /etc/etcd/dh/latest/dh-etcd.service, find the line ExecStart= and run the ExecStart command as the etcd user account”.

This is typically:

sudo -u etcd /bin/bash -c "GOMAXPROCS=4 /usr/bin/etcd \
    --config-file '/etc/etcd/dh/real_cluster_id/config.yaml'"

Running etcd like this guarantees you will have all etcd output in your terminal without messing around in potentially noisy journalctl logs.

If you have been running etcd manually as the wrong user account, you can wind up with file permission errors that systemctl cannot reconcile for you.

find  /var/lib/etcd -not -user etcd -exec ls -l {} \;

If the above reports any files, consider fixing with:

source /usr/illumon/latest/bin/dh_users
find  /var/lib/etcd -not -user etcd -exec chown etcd:${DH_ADMIN_GROUP:-irisadmin} {} \;

Scrapping a failed etcd recovery

If something goes wrong while you are testing etcd recovery processes, and you have accidentally corrupted etcd’s internal metadata (for example, by adding multiple nodes at one time to a single node cluster and losing quorum), then you may wish to throw away everything and try again. In this case, rerun the appropriate section of Step 4, above.

ip_1=10.128.1.181
ip_2=10.128.1.183
ip_3=10.128.1.184
cluster_id=bestcluster
# the stop and disable below might fail if you already disabled dh-etcd service
systemctl stop dh-etcd
systemctl disable dh-etcd
systemctl daemon-reload
rm -rf /var/lib/etcd/dh/${cluster_id}.bak
mv /var/lib/etcd/dh/${cluster_id} /var/lib/etcd/dh/${cluster_id}.bak
# example for etcd server 1 is shown. Adjust appropriately for each server.
sudo -u etcd -g irisadmin ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd.snap \
  --name etcd-1 \
  --initial-cluster etcd-1=https://${ip_1}:2380,etcd-2=https://${ip_2}:2380,etcd-3=https://${ip_3}:2380 \
  --initial-cluster-token ${cluster_id} \
  --initial-advertise-peer-urls https://${ip_1}:2380 \
  --data-dir=/var/lib/etcd/dh/${cluster_id}
/usr/illumon/latest/install/etcd/enable_dh_etcd_systemd.sh --sequential

Troubleshoot errors in etcd