Kubernetes etcd Bbackup and recovery

Etcd backup and restore procedures in a Kubernetes installation differ slightly from the procedures in a default installation.. Note that this page refers to the etcd installation used solely by Deephaven for its own purposes and is completely separate from the etcd instance used by the Kubernetes system.

Prerequisites

The Helm installation name, Kubernetes namespace, and root etcd password are needed to backup and restore the etcd cluster.

To find the Helm install name, run helm list and look for the etcd listing. For the example commands on this page, the etcd Helm install name is stor, the namespace is deephaven, and the root password is plaintext-root-password. Substitute your system's values for them when operating in your cluster.

$ helm list
NAME            NAMESPACE   REVISION    UPDATED                                 STATUS      CHART           APP VERSION
stor            deephaven    1           2024-03-14 14:58:50.304509 -0400 EDT    deployed    etcd-9.14.2     3.5.12

The etcd root user password is stored in a secret.

# Find the name of the secret for the etcd root password
$ kubectl get secrets -l app.kubernetes.io/component=etcd
NAME        TYPE     DATA   AGE
stor-etcd   Opaque   1      100d

# Get the root password value
$ kubectl get secret stor-etcd -o jsonpath='{.data.etcd-root-password}' | base64 -d
plaintext-root-password

Taking a snapshot of etcd

Use one of the nodes in your etcd cluster to take a snapshot and store it temporarily.

# List all pods in the etcd cluster
$ kubectl get pods -l app.kubernetes.io/instance=stor
NAME          READY   STATUS    RESTARTS   AGE
stor-etcd-0   1/1     Running   0          100d3h
stor-etcd-1   1/1     Running   0          100d8h
stor-etcd-2   1/1     Running   0          100d9h

# Run command to save an etcd snapshot to a file on one of the nodes. The root password is required.
$ kubectl exec stor-etcd-0 -- etcdctl --user root:plaintext-root-password snapshot save /tmp/etcd-snapshot.db

Copy the snapshot from the etcd node to your local filesystem. After the snapshot file is copied, it is no longer needed on the etcd node and can be deleted.

# Copy the snapshot from the etcd node to your local filesystem
$ kubectl cp stor-etcd-0:/tmp/etcd-snapshot.db etcd-snapshot.db

# Delete the snapshot once it is safely copied
$ kubectl exec stor-etcd-0 -- rm /tmp/etcd-snapshot.db

Restoring from a snapshot of etcd

To start a new etcd cluster that restores from a snapshot file, the snapshot file must be on a read-write-many (RWX) persistent volume. This example uses the NFS deployment that is part of the Deephaven installation, though it can be any other RWX storage available to you.

On the NFS pod, there should be a folder named either /exports/exports/dhsystem, or just /exports/dhsystem, depending on the version you have installed. The correct folder has a db directory.

# Confirm the directory that has the 'db' directory in it
$ kubectl exec deploy/deephaven-nfs-server -- ls -l /exports/exports/dhsystem
drwxr-xr-x 6 root root 4096 Jun 17 14:58 db

Create another folder alongside db and copy the snapshot to it.

# Create directory for the etcd snapshot file
$ kubectl exec deploy/deephaven-nfs-server -- mkdir -p /exports/exports/dhsystem/etcd-snap/restore

# Copy the snapshot file to the new directory
$ mynfspod=$(kubectl get pods -l role=deephaven-nfs-server -o custom-columns='NAME:.metadata.name' --no-headers)
$ kubectl cp etcd-snapshot.db ${mynfspod}:/exports/exports/dhsystem/etcd-snap/restore/etcd-snapshot.db

A persistent volume claim (PVC) and persistent volume (PV) are required to start a new etcd cluster from a snapshot. Below is an example of a YAML file for a PVC and PV to use as a template. See the # comments, and change the spec.nfs.server value to the NFS server name or IP address for your cluster. Remove the comments from your file and save it as etcd-restore-vol.yaml, then create the volume and claim by running kubectl apply -f etcd-restore-vol.yaml.

PV and PVC yaml example for etcd restore
apiVersion: v1
kind: PersistentVolume
metadata:
  name: my-etcd-restore-snapshot-pv
spec:
  accessModes:
    - ReadWriteMany
  capacity:
    storage: 10Gi
  mountOptions:
    - hard
    - nfsvers=4.1
  nfs:
    # This path is correct even if the NFS server has /exports/exports because /exports exposed as the
    # NFS root. This would be the correct path relative to that.
    path: /exports/dhsystem/etcd-snap/restore
    # The server value should be: <deephaven-nfs-service-name>.<your-k8s-namespace>.svc.cluster.local.
    # However, the DNS name does not always resolve (AKS, EKS), so you can use the IP address directly
    # for the server value. Get the NFS service IP with 'kubectl get svc deephaven-nfs'.
    server: deephaven-nfs.deephaven.svc.cluster.local
  persistentVolumeReclaimPolicy: Retain
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-etcd-restore-snapshot-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ''
  volumeName: my-etcd-restore-snapshot-pv
  resources:
    requests:
      storage: 10Gi

You are now ready to install the etcd Helm chart. This is similar to when Deephaven is initially installed, with a few more properties specified so that it can be restored from a snapshot. Substitute the name of your new etcd installation for the stor2 name used in this example. Note that you need the etcd root password as described above.

$ helm install stor2 bitnami/etcd -f dev-values/etcdValues.yaml        \
    --set startFromSnapshot.enabled=true                               \
    --set startFromSnapshot.existingClaim=my-etcd-restore-snapshot-pvc \
    --set startFromSnapshot.snapshotFilename=etcd-snapshot.db          \
    --set replicaCount=3                                               \
    --set auth.rbac.enabled=true                                       \
    --set auth.rbac.rootPassword=plaintext-root-password

Updating Deephaven for the new etcd cluster name

The endpoints data in Deephaven needs to be changed to point to the new cluster. For the new configuration, we need the non-headless etcd service URL for the new cluster obtained from running kubectl get svc -l app.kubernetes.io/instance=stor2, using the name of your installation in place of stor2.

# Get the non-headless service name for the new etcd instance
$ kubectl get svc -l app.kubernetes.io/instance=stor2
NAME                  TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
stor2-etcd            ClusterIP   10.122.9.149   <none>        2379/TCP,2380/TCP   26d
stor2-etcd-headless   ClusterIP   None           <none>        2379/TCP,2380/TCP   26d

The format for the endpoint is http://<etcd-k8s-service-name>.<namespace>.svc.cluster.local:2379. For this example, that is http://stor2.deephaven.svc.cluster.local:2379. Substitute the values for your environment.

The next step is to update some of our secrets with this information. Data in Kubernetes secrets is base64-encoded, so first, encode the endpoint.

# Base64 encode the etcd non-headless endpoint
$ echo -n http://stor2.deephaven.svc.cluster.local:2379 | base64
aHR0cDovL3N0b3IyLmRlZXBoYXZlbi5zdmMuY2x1c3Rlci5sb2NhbDoyMzc5

Use this snippet to update the secrets with your encoded endpoint value.

# Update the secrets with the new encoded endpoint value
$ secret_names=$(kubectl get secret -l role=etcd-client-cred -o custom-columns='NAME:.metadata.name' --no-headers)
$ for sn in $secret_names; do
>   kubectl patch secrets $sn --type='json' -p='[{"op" : "replace" ,"path" : "/data/endpoints" ,"value" : "aHR0cDovL3N0b3IyLmRlZXBoYXZlbi5zdmMuY2x1c3Rlci5sb2NhbDoyMzc5"}]'
> done

We are now ready to do the final Helm upgrade of Deephaven. You most likely used a YAML file to define Helm chart token values for your environment when Deephaven was initially installed. These include things such as the URL of the Deephaven installation, the Docker image repository, and the Deephaven NFS server information. That file contains a YAML value for etcd.release with your original etcd installation name. Change that to your new etcd installation name. If you do not have that file, you can get the values provided by running helm get notes <deephaven installation name>. If you do not know the Deephaven Helm installation name, run helm list.

The Deephaven Helm installation package includes a setupTools/scaleAll.sh script. Scale the deployments to 0:

setupTools/scaleAll.sh 0

Then do the Helm upgrade as you normally would as described here.