Deephaven Etcd Cluster Recovery Guide

Rebuilding Etcd after Quorum Loss

An Etcd cluster suffers “quorum loss” once 50% or more of the machines in the cluster are in an unhealthy state. Once a cluster is in the state of quorum loss, for example, if more than half the machines are lost or change their IP addresses, the etcd cluster is considered irrecoverable, and a new cluster must be rebuilt from the old one.

While it is possible to forcibly resize the cluster down to a single node, then add back individual nodes, this process is complex, not always possible, and differs based on the version of etcd you are using. The simpler and most ubiquitous method of recovery is to rebuild a new cluster using a backup snapshot of the etcd database.

This guide will detail how to manually take regular snapshots, and how to recover your Deephaven etcd cluster from said snapshots.

Step 0: Take regular snapshots of your etcd cluster

Before you can recover an etcd cluster from quorum loss, you must have a snapshot of the etcd database, which you can obtain like so:

The file written to /tmp/etcd.snap contains everything you need to recover your cluster, and it is recommended to store a nightly backup in a secure location.

In case of emergencies where you do not already have a backup available, you may be able to retrieve a snapshot from any remaining functional etcd node, or as a last resort, by copying files from the disk of your etcd server.

  • To query a single server with etcdctl.sh, just add --endpoints=${YOUR_ETCD_IP}:2379 to the snapshot save command above.
  • To find an on-disk snapshot, look in /var/lib/etcd/dh/*/member/snap/db, where * is your etcd cluster token.
    • To see your etcd cluster token, grep cluster-token /etc/etcd/dh/latest/config.yaml
    • You should only use the files on disk as a last resort because they lack the necessary checksums to ensure the integrity of the database file.

When restoring your etcd cluster, you should upload your backup file to each machine that will be participating as an etcd server in your cluster. We will assume the location of /tmp/etcd.snap throughout this documentation.

Step 1: Remove any existing dh-etcd systemctl service

In cases where your etcd machines are still functional, but you wish, for example, to change the IP address of that machine using a separate interface, you should first log in to each machine and remove the current dh-etcd.service.

It is also strongly recommended to run /usr/illumon/latest/bin/dh_monit down --block on all Deephaven servers, to completely turn all Deephaven process off during the etcd disaster recovery process.

Step 2: Generate and distribute new etcd cluster configuration

Because Deephaven etcd clusters use peer and server ssl certificates, it is necessary to regenerate and redistribute the complete set of cluster configuration files at one time.

It is also necessary to manually create a backup of your etcd client passwords as part of this process. The newly generated ssl certificates and IP address endpoints files will be created in an archive which also contains new etcd user account passwords that will not work with a rebuilt etcd cluster. The script snippet below shows you how to reconcile this matter by keeping the existing passwords:

Once the config_generator.sh command has been run, a newly created etcd cluster configuration tar file will be located at /etc/sysconfig/deephaven/etcd/dh_etcd_config.tgz, you must copy this file to each etcd server (typically 3-5 machines).

Additionally, there will be a /tmp/$cluster_id/dh_query.tgz “query config” package that is created, which must be copied-to and unpackaged-on all machines running any Deephaven processes.

Copy both of these files from your configuration server machine to any machine capable of transferring files between the other machines of your Deephaven cluster.

Step 3: Distribute the new etcd configuration files

To unpackage the etcd server configuration files:

The <SERVER_NUMBER> is the 1-indexed “etcd server number” based on the etcd_ips="$ip_1 $ip_2 $ip_3" list passed to config_generator.sh in the above snippet. On machine matching ip_1, your <SERVER_NUMBER> is 1.

To unpackage the etcd client configuration files:

You must include the configuration_server node who created the query package when performing this unpackage operation.

Step 4: Restore the etcd servers from snapshot

Once you have unpackaged the etcd client and server configuration files, your final step is to restore and start the etcd servers.

On each of the etcd servers (we will assume a 3 node cluster here, but 5 servers is recommended)

Full code to run on etcd ip_1 machine:

The first nine lines of the above code snippet are the same for all machines, and will be obviated for brevity.

Snippet to run on etcd ip_2 machine:

Snippet to run on etcd ip_3 machine:

Step 5: Turn everything back on

Once the above process is complete, you should expect a clean bill of health from your etcd cluster.

Shell into your configuration_server node and run the following commands to test:

A healthy cluster should display:

Verify that your old data is present in your new etcd cluster:

21640

On a healthy, restored system, there should be tens of thousands of entries found. You may wish to check a subset, like /main/config/props which should only have a few hundred results.

Assuming the above commands both work correctly, turn monit back on across your entire Deephaven cluster:

Troubleshooting

In the event that things do not go smoothly for you, the following suggestions can help in solving problems.

Enable etcd debug logging

On your etcd server, edit the file /etc/etcd/dh/latest/config.yaml:

In versions of etcd older than 3.5, you must set debug: true instead of log-level: debug.

Enabling debug logging causes journalctl to report more verbose log messages:

For installations with restricted sudo permissions, you may need to add --no-pager, or otherwise format your command based on the sudoers rules reported by sudo -l -U your_user (note that the -U is capitalized).

The journalctl command above shows only dh-etcd process output. However, some additional logging related to etcd is only visible in the full journalctl log (no -u flag):

Running etcd directly

Even with debug logging to journalctl, some interesting output can be lost when using systemd, so you may wish to temporarily turn it off and invoke etcd directly.

The above snippet is a bash-one-liner for “open the file /etc/etcd/dh/latest/dh-etcd.service, find the line ExecStart= and run the ExecStart command as the etcd user account”.

This is typically:

Running etcd like this guarantees you will have all etcd output in your terminal without messing around in potentially noisy journalctl logs.

If you have been running etcd manually as the wrong user account, you can wind up with file permission errors that systemctl cannot reconcile for you.

If the above reports any files, consider fixing with:

Scrapping a failed etcd recovery

If something goes wrong while you are testing etcd recovery processes, and you have accidentally corrupted etcd’s internal metadata (for example, by adding multiple nodes at one time to a single node cluster and losing quorum), then you may wish to throw away everything and try again. In this case, rerun the appropriate section of Step 4, above.