Kubernetes etcd backup and recovery

Etcd backup and restore procedures in a Kubernetes installation differ slightly from the procedures in a default installation. Note that this page refers to the etcd installation used solely by Deephaven for its own purposes and is completely separate from the etcd instance used by the Kubernetes system.

Overview

The procedure for restoring the Deephaven etcd cluster is as follows:

  • Take a backup snapshot of the existing etcd cluster, if necessary. A recent snapshot may already be available if your original etcd cluster was configured with disaster recovery.
  • Create a Persistent Volume and Persistent Volume Claim (PVC) that will serve as the restore and backup location for the new etcd cluster. It must be a read-write-many (RWX) volume. In this example, we use the NFS server that is part of the Deephaven installation, but it can be any RWX storage available to you.
  • Install a new etcd cluster from the snapshot file.
  • Update the Deephaven configuration to point to the new etcd cluster.

Prerequisites

It is assumed that you have kubectl and helm command-line tools installed and configured for your target namespace in your Kubernetes cluster. You will also need the following information to proceed with the etcd backup and restore:

  • Helm install name of etcd. Substitute this name wherever <etcd-install-name> appears in the examples below.

  • Kubernetes namespace. Substitute this name wherever <k8s-namespace> appears in the examples below.

  • Root password for etcd. Substitute this password wherever <etcd-root-password> appears in the examples below.

  • The etcd helm chart package file named bitnami-etcd-helm-11.3.6.tgz that is contained within the deephaven-helm distribution used to install Deephaven, and the Docker images package bitnami-etcd-containers-11.3.6.tar.gz.

    Note

    If you installed Deephaven with an earlier version and do not have the bitnami files, please contact Deephaven support to obtain them.

Finding your etcd Helm install name, namespace, and root password

To find the Helm install name, run helm list and look for the etcd listing.

The etcd root user password is stored in a secret whose name contains the Helm install name.

Determine if etcd disaster recovery is enabled

If your etcd cluster was configured with disaster recovery, a recent snapshot may already be available. You can check by looking at the values used to configure the initial etcd installation. Run helm get values <etcd-install-name> and look for the disasterRecovery.enabled value. If it is set to true, a snapshot should be available in the location specified by either disasterRecovery.persistentVolume.existingClaim or startFromSnapshot.existingClaim.

An example section of the output of the helm get values <etcd-install-name> command is shown below, where disaster recovery is enabled.

Take a snapshot of etcd

If snapshotting is enabled, the etcd nodes store backup snapshots in the /snapshots directory by default. First, list the available snapshots to identify their names, then copy the most recent snapshot from the node.

If disaster recovery is not enabled on the etcd cluster, you will need to take a snapshot manually on one of the nodes and then copy it from the node.

Restore from the etcd snapshot

Prepare the snapshot file on your RWX storage

To start a new etcd cluster that restores from a snapshot file, the snapshot file must be on a read-write-many (RWX) persistent volume. This example uses the NFS deployment that is part of the Deephaven installation. If you are using other RWX storage for your deployment, the process to place the file there would differ.

On the NFS pod, there should be a folder named either /exports/exports/dhsystem, or just /exports/dhsystem, depending on the version you have installed. The correct folder contains a db directory.

Create another folder here named etcd-backup2 and copy the snapshot to it. Note that if etcd disaster recovery was configured, there is probably another folder alongside db named etcd-backup.

Create a Persistent Volume and Persistent Volume Claim for the snapshot

A Persistent Volume Claim (PVC) and Persistent Volume (PV) are required to start a new etcd cluster from a snapshot. The section below is an example of a YAML file for a PVC and PV to use as a template. Read the comments, change the PVC and PV names to your chosen names (e.g. etcd-dr-snapshot-pv and etcd-dr-snapshot-pvc), and change the spec.nfs.server value to your cluster's NFS server name or IP address. If you are using other RWX storage, the YAML will differ.

Save as a file named etcd-restore-vol.yaml, then apply the file to create the volume and claim.

Apply full access permissions to the snapshot directory

The etcd pods writing to the snapshot volume run with user and group 1001:1001, so the permissions on the snapshot directory must allow for that. Run the following command to set full permissions.

Load the etcd Docker images to your repository

Ensure that your repository has the etcd Docker images available. If they need to be added to your repository, see this section of the installation guide. If you do not have the bitnami-etcd-containers-11.3.6.tar.gz file, contact Deephaven support to obtain it.

Install the new etcd cluster

You are now ready to install the etcd Helm chart. The following command specifies properties to restore the cluster from a snapshot. Substitute a name for your new etcd installation (e.g., dh-etcd2), your repository name (e.g., my-repo.dev/my-project/images), and the password from your original etcd cluster.

Note

The image registry, repository and tag values used below will be combined to form the full image url used in the pods.

If your repo has the images stored at my-repo.dev/my-project/images/bitnami/etcd:3.5.21-debian-12-r5 then you will use image.registry=my-repo.dev/my-project/images below, and the image.repository and image.tag values will remain as shown.

Monitor the installation with kubectl get pods -l app.kubernetes.io/instance=<new-helm-install-name> -w. Verify that all pods have a status of Running and the ready column shows 1/1, indicating 1 out of 1 container is in a ready state. This may take a few minutes, depending on how much data is in the restore file.

Updating Deephaven for the new etcd cluster name

The endpoints data in Deephaven must now be updated to point to the new cluster. You will need the non-headless etcd service URL for the new cluster.

The format for the endpoint is http://<etcd-service-name>.<k8s-namespace>.svc.cluster.local:2379. Substitute your values to formulate the endpoint for your environment. For example, if your etcd service name is dh-etcd2 and your k8s namespace is dhns, the endpoint would be http://dh-etcd2.dhns.svc.cluster.local:2379.

Because Kubernetes secrets store data in base64-encoded format, encode your endpoint with the following command:

Use this snippet to update the secrets with your encoded endpoint value.

You are now ready to perform the final Helm upgrade of Deephaven.

When Deephaven was initially installed, you most likely used a YAML file to define Helm chart values for your environment (such as the Deephaven URL, Docker image repository, and NFS server information). That file contains a YAML value for etcd.release with your original etcd installation name. Change that value to your new etcd installation name.

If you do not have that file, retrieve the values by running helm get values <deephaven installation name>. If you do not know the Deephaven Helm installation name, run helm list.

The Deephaven Helm installation package includes a setupTools/scaleAll.sh script. Scale the deployments to 0:

Then do the Helm upgrade as you normally would, as described here.