Kubernetes etcd backup and recovery

Etcd backup and restore procedures in a Kubernetes installation differ slightly from the procedures in a default installation. Note that this page refers to the etcd installation used solely by Deephaven for its own purposes and is completely separate from the etcd instance used by the Kubernetes system.

Overview

The procedure for restoring the Deephaven etcd cluster is as follows:

  • Take a backup snapshot of the existing etcd cluster, if necessary. A recent snapshot may already be available if your original etcd cluster was configured with disaster recovery.
  • Create a Persistent Volume and Persistent Volume Claim (PVC) that will serve as the restore and backup location for the new etcd cluster. It must be a read-write-many (RWX) volume. In this example, we use the NFS server that is part of the Deephaven installation, but it can be any RWX storage available to you.
  • Install a new etcd cluster from the snapshot file.
  • Update the Deephaven configuration to point to the new etcd cluster.

Prerequisites

It is assumed that you have kubectl and helm command-line tools installed and configured for your target namespace in your Kubernetes cluster. You will also need the following information to proceed with the etcd backup and restore:

  • Helm install name of etcd. Substitute this name wherever <etcd-install-name> appears in the examples below.

  • Kubernetes namespace. Substitute this name wherever <k8s-namespace> appears in the examples below.

  • Root password for etcd. Substitute this password wherever <etcd-root-password> appears in the examples below.

  • The etcd helm chart package file named bitnami-etcd-helm-11.3.6.tgz that is contained within the deephaven-helm distribution used to install Deephaven, and the Docker images package bitnami-etcd-containers-11.3.6.tar.gz.

    Note

    If you installed Deephaven with an earlier version and do not have the bitnami files, please contact Deephaven support to obtain them.

Finding your etcd Helm install name, namespace, and root password

To find the Helm install name, run helm list and look for the etcd listing.

$ helm list
NAME                 NAMESPACE        REVISION    UPDATED                                 STATUS      CHART           APP VERSION
<etcd-install-name>  <k8s-namespace>  1           2024-03-14 14:58:50.304509 -0400 EDT    deployed    etcd-11.3.6     3.5.21

The etcd root user password is stored in a secret whose name contains the Helm install name.

# Find the name of the secret for the etcd root password
$ kubectl get secrets -l app.kubernetes.io/component=etcd
NAME                       TYPE     DATA   AGE
<etcd-secret-name>         Opaque   1      100d

# Get your root etcd password value with this command
$ kubectl get secret <etcd-secret-name> -o jsonpath='{.data.etcd-root-password}' | base64 -d

Determine if etcd disaster recovery is enabled

If your etcd cluster was configured with disaster recovery, a recent snapshot may already be available. You can check by looking at the values used to configure the initial etcd installation. Run helm get values <etcd-install-name> and look for the disasterRecovery.enabled value. If it is set to true, a snapshot should be available in the location specified by either disasterRecovery.persistentVolume.existingClaim or startFromSnapshot.existingClaim.

An example section of the output of the helm get values <etcd-install-name> command is shown below, where disaster recovery is enabled.

disasterRecovery:
  cronjob:
    schedule: '*/10 * * * *'
    snapshotHistoryLimit: 1
  enabled: true
  pvc:
    existingClaim: <backup-snapshot-pvc-name>

Take a snapshot of etcd

If snapshotting is enabled, the etcd nodes store backup snapshots in the /snapshots directory by default. First, list the available snapshots to identify their names, then copy the most recent snapshot from the node.

# Find the names of your etcd pods
kubectl get pod -l app.kubernetes.io/component=etcd

# Find etcd snapshots stored on any one of the etcd nodes
kubectl exec <etcd-pod-name> -- ls /snapshots
db-2025-09-16_19-20
db-2025-09-16_19-30

# Copy the most recent snapshot to your local filesystem as etcd-snapshot.db
kubectl cp <etcd-pod-name>:/snapshots/db-2025-09-16_19-30 etcd-snapshot.db --retries=10

If disaster recovery is not enabled on the etcd cluster, you will need to take a snapshot manually on one of the nodes and then copy it from the node.

# Find the names of your etcd pods
kubectl get pod -l app.kubernetes.io/component=etcd

# Run command to save an etcd snapshot to a file on one of the nodes. The root password is required.
kubectl exec <etcd-pod-name> -- etcdctl --user root:<etcd-root-password> snapshot save /tmp/etcd-snapshot.db

# Copy the snapshot to your local filesystem as etcd-snapshot.db
kubectl cp <etcd-pod-name>:/tmp/etcd-snapshot-db etcd-snapshot.db --retries=10

Restore from the etcd snapshot

Prepare the snapshot file on your RWX storage

To start a new etcd cluster that restores from a snapshot file, the snapshot file must be on a read-write-many (RWX) persistent volume. This example uses the NFS deployment that is part of the Deephaven installation. If you are using other RWX storage for your deployment, the process to place the file there would differ.

On the NFS pod, there should be a folder named either /exports/exports/dhsystem, or just /exports/dhsystem, depending on the version you have installed. The correct folder contains a db directory.

# Confirm the directory that has the 'db' directory in it
$ kubectl exec deploy/deephaven-nfs-server -- ls -l /exports/exports/dhsystem
drwxr-xr-x 6 root root 4096 Jun 17 14:58 db

Create another folder here named etcd-backup2 and copy the snapshot to it. Note that if etcd disaster recovery was configured, there is probably another folder alongside db named etcd-backup.

# Create directory for the etcd snapshot file
$ kubectl exec deploy/deephaven-nfs-server -- mkdir -p /exports/exports/dhsystem/etcd-backup2

# Copy the snapshot file to the new directory
$ mynfspod=$(kubectl get pods -l role=deephaven-nfs-server -o custom-columns='NAME:.metadata.name' --no-headers)
$ kubectl cp etcd-snapshot.db ${mynfspod}:/exports/exports/dhsystem/etcd-backup2/etcd-snapshot.db --retries=10

Create a Persistent Volume and Persistent Volume Claim for the snapshot

A Persistent Volume Claim (PVC) and Persistent Volume (PV) are required to start a new etcd cluster from a snapshot. The section below is an example of a YAML file for a PVC and PV to use as a template. Read the comments, change the PVC and PV names to your chosen names (e.g. etcd-dr-snapshot-pv and etcd-dr-snapshot-pvc), and change the spec.nfs.server value to your cluster's NFS server name or IP address. If you are using other RWX storage, the YAML will differ.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: <your-etcd-restore-snapshot-pv-name>
spec:
  accessModes:
    - ReadWriteMany
  capacity:
    storage: 10Gi
  mountOptions:
    - hard
    - nfsvers=4.1
  nfs:
    # This path is correct even if the NFS server has /exports/exports because /exports is exposed as the
    # NFS root. This would be the correct path relative to that.
    path: /exports/dhsystem/etcd-backup2
    # The server value should be: <deephaven-nfs-service-name>.<k8s-namespace>.svc.cluster.local.
    # However, the DNS name does not always resolve (e.g. in AKS, EKS), so you can use the IP address directly
    # for the server value. Get the NFS service IP with command 'kubectl get svc deephaven-nfs'.
    server: deephaven-nfs.<k8s-namespace>.svc.cluster.local
  persistentVolumeReclaimPolicy: Retain
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: <your-etcd-restore-snapshot-pvc-name>
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ''
  volumeName: <your-etcd-restore-snapshot-pv-name>
  resources:
    requests:
      storage: 10Gi

Save as a file named etcd-restore-vol.yaml, then apply the file to create the volume and claim.

kubectl apply -f etcd-restore-vol.yaml

Apply full access permissions to the snapshot directory

The etcd pods writing to the snapshot volume run with user and group 1001:1001, so the permissions on the snapshot directory must allow for that. Run the following command to set full permissions.

# Set full permissions on the snapshot directory
kubectl exec -it svc/deephaven-nfs -- bash -c "chmod 777 /exports/exports/dhsystem/etcd-backup2"

# And confirm with this command
kubectl exec -it svc/deephaven-nfs -- bash -c "ls -ld /exports/exports/dhsystem/etcd-backup2"

Load the etcd Docker images to your repository

Ensure that your repository has the etcd Docker images available. If they need to be added to your repository, see this section of the installation guide. If you do not have the bitnami-etcd-containers-11.3.6.tar.gz file, contact Deephaven support to obtain it.

Install the new etcd cluster

You are now ready to install the etcd Helm chart. The following command specifies properties to restore the cluster from a snapshot. Substitute a name for your new etcd installation (e.g., dh-etcd2), your repository name (e.g., urco-docker.pkg.dev/deephaven/images), and the password from your original etcd cluster.

Note

The image registry, repository and tag values used below will be combined to form the full image url used in the pods.

If your repo has the images stored at urco-docker.pkg.dev/deephaven/images/bitnami/etcd:3.5.21-debian-12-r5 then you will use image.registry=urco-docker.pkg.dev/deephaven/images below, and the image.repository and image.tag values will remain as shown.

# Note that the image registry, repository and tag values will be combined to form the full image
# url, so if your repo has urco-docker.pkg.dev/deephaven/images/bitnami/etcd:3.5.21-debian-12-r5
# then you would use image.registry=urco-docker.pkg.dev/deephaven/images
helm install <new-helm-install-name> bitnami-etcd-helm-11.3.6.tgz \
    --set image.registry=<repository-url-and-path> \
    --set image.repository=bitnami/etcd \
    --set image.tag=3.5.21-debian-12-r5 \
    --set global.security.allowInsecureImages=true \
    --set startFromSnapshot.enabled=true \
    --set startFromSnapshot.existingClaim=<your-backup-pvc-name> \
    --set startFromSnapshot.snapshotFilename=etcd-snapshot.db \
    --set disasterRecovery.cronjob.schedule="*/10 * * * *" \
    --set disasterRecovery.cronjob.snapshotHistoryLimit=1 \
    --set disasterRecovery.enabled=true \
    --set replicaCount=3 \
    --set auth.rbac.enabled=true \
    --set resourcesPreset=medium \
    --set auth.rbac.rootPassword=<your-plaintext-root-password> \
    --timeout 10m

Monitor the installation with kubectl get pods -l app.kubernetes.io/instance=<new-helm-install-name> -w. Verify that all pods have a status of Running and the ready column shows 1/1, indicating 1 out of 1 container is in a ready state. This may take a few minutes, depending on how much data is in the restore file.

Updating Deephaven for the new etcd cluster name

The endpoints data in Deephaven must now be updated to point to the new cluster. You will need the non-headless etcd service URL for the new cluster.

# There will be two services for the etcd cluster, we need the non-headless service name
kubectl get svc -l app.kubernetes.io/instance=<new-helm-install-name>
NAME                              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
<new-etcd-service-name>           ClusterIP   10.122.9.149   <none>        2379/TCP,2380/TCP   6m31s
<new-etcd-service-name>-headless  ClusterIP   None           <none>        2379/TCP,2380/TCP   6m31s

The format for the endpoint is http://<etcd-service-name>.<k8s-namespace>.svc.cluster.local:2379. Substitute your values to formulate the endpoint for your environment. For example, if your etcd service name is dh-etcd2 and your k8s namespace is dhns, the endpoint would be http://dh-etcd2.dhns.svc.cluster.local:2379.

Because Kubernetes secrets store data in base64-encoded format, encode your endpoint with the following command:

# Base64 encode the etcd non-headless endpoint, substituting your values for the
# new helm install name and k8s namespace
echo -n http://<new-etcd-service-name>.<k8s-namespace>.svc.cluster.local:2379 | base64
aHR0cDov...bDoyMzc5

Use this snippet to update the secrets with your encoded endpoint value.

# Update the secrets with the new encoded endpoint value
my_secrets=$(kubectl get secret -l role=etcd-client-cred -o custom-columns='_:.metadata.name' --no-headers)
my_new_endpoint="aHR0cDov...bDoyMzc5"  # Substitute your base64 encoded endpoint value here
for s in $my_secrets; do
  kubectl patch secrets $s --type="json" -p="[{"op": "replace", "path": "/data/endpoints", "value": "${my_new_endpoint}"}]"
done

You are now ready to perform the final Helm upgrade of Deephaven.

When Deephaven was initially installed, you most likely used a YAML file to define Helm chart values for your environment (such as the Deephaven URL, Docker image repository, and NFS server information). That file contains a YAML value for etcd.release with your original etcd installation name. Change that value to your new etcd installation name.

If you do not have that file, retrieve the values by running helm get values <deephaven installation name>. If you do not know the Deephaven Helm installation name, run helm list.

The Deephaven Helm installation package includes a setupTools/scaleAll.sh script. Scale the deployments to 0:

setupTools/scaleAll.sh 0

Then do the Helm upgrade as you normally would, as described here.