Kubernetes etcd backup and recovery
Etcd backup and restore procedures in a Kubernetes installation differ slightly from the procedures in a default installation. Note that this page refers to the etcd installation used solely by Deephaven for its own purposes and is completely separate from the etcd instance used by the Kubernetes system.
Overview
The procedure for restoring the Deephaven etcd cluster is as follows:
- Take a backup snapshot of the existing etcd cluster, if necessary. A recent snapshot may already be available if your original etcd cluster was configured with disaster recovery.
- Create a Persistent Volume and Persistent Volume Claim (PVC) that will serve as the restore and backup location for the new etcd cluster. It must be a read-write-many (RWX) volume. In this example, we use the NFS server that is part of the Deephaven installation, but it can be any RWX storage available to you.
- Install a new etcd cluster from the snapshot file.
- Update the Deephaven configuration to point to the new etcd cluster.
Prerequisites
It is assumed that you have kubectl
and helm
command-line tools installed and configured for your target namespace in your Kubernetes cluster. You will also need the following information to proceed with the etcd backup and restore:
-
Helm install name of etcd. Substitute this name wherever
<etcd-install-name>
appears in the examples below. -
Kubernetes namespace. Substitute this name wherever
<k8s-namespace>
appears in the examples below. -
Root password for etcd. Substitute this password wherever
<etcd-root-password>
appears in the examples below. -
The etcd helm chart package file named
bitnami-etcd-helm-11.3.6.tgz
that is contained within thedeephaven-helm
distribution used to install Deephaven, and the Docker images packagebitnami-etcd-containers-11.3.6.tar.gz
.Note
If you installed Deephaven with an earlier version and do not have the bitnami files, please contact Deephaven support to obtain them.
Finding your etcd Helm install name, namespace, and root password
To find the Helm install name, run helm list
and look for the etcd listing.
$ helm list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
<etcd-install-name> <k8s-namespace> 1 2024-03-14 14:58:50.304509 -0400 EDT deployed etcd-11.3.6 3.5.21
The etcd root user password is stored in a secret whose name contains the Helm install name.
# Find the name of the secret for the etcd root password
$ kubectl get secrets -l app.kubernetes.io/component=etcd
NAME TYPE DATA AGE
<etcd-secret-name> Opaque 1 100d
# Get your root etcd password value with this command
$ kubectl get secret <etcd-secret-name> -o jsonpath='{.data.etcd-root-password}' | base64 -d
Determine if etcd disaster recovery is enabled
If your etcd cluster was configured with disaster recovery, a recent snapshot may already be available. You can check by looking at the values used to configure the initial etcd installation. Run helm get values <etcd-install-name>
and look for the disasterRecovery.enabled
value. If it is set to true
, a snapshot should be available in the location specified by either disasterRecovery.persistentVolume.existingClaim
or startFromSnapshot.existingClaim
.
An example section of the output of the helm get values <etcd-install-name>
command is shown below, where disaster recovery is enabled.
disasterRecovery:
cronjob:
schedule: '*/10 * * * *'
snapshotHistoryLimit: 1
enabled: true
pvc:
existingClaim: <backup-snapshot-pvc-name>
Take a snapshot of etcd
If snapshotting is enabled, the etcd nodes store backup snapshots in the /snapshots
directory by default. First, list the available snapshots to identify their names, then copy the most recent snapshot from the node.
# Find the names of your etcd pods
kubectl get pod -l app.kubernetes.io/component=etcd
# Find etcd snapshots stored on any one of the etcd nodes
kubectl exec <etcd-pod-name> -- ls /snapshots
db-2025-09-16_19-20
db-2025-09-16_19-30
# Copy the most recent snapshot to your local filesystem as etcd-snapshot.db
kubectl cp <etcd-pod-name>:/snapshots/db-2025-09-16_19-30 etcd-snapshot.db --retries=10
If disaster recovery is not enabled on the etcd cluster, you will need to take a snapshot manually on one of the nodes and then copy it from the node.
# Find the names of your etcd pods
kubectl get pod -l app.kubernetes.io/component=etcd
# Run command to save an etcd snapshot to a file on one of the nodes. The root password is required.
kubectl exec <etcd-pod-name> -- etcdctl --user root:<etcd-root-password> snapshot save /tmp/etcd-snapshot.db
# Copy the snapshot to your local filesystem as etcd-snapshot.db
kubectl cp <etcd-pod-name>:/tmp/etcd-snapshot-db etcd-snapshot.db --retries=10
Restore from the etcd snapshot
Prepare the snapshot file on your RWX storage
To start a new etcd cluster that restores from a snapshot file, the snapshot file must be on a read-write-many (RWX) persistent volume. This example uses the NFS deployment that is part of the Deephaven installation. If you are using other RWX storage for your deployment, the process to place the file there would differ.
On the NFS pod, there should be a folder named either /exports/exports/dhsystem
, or just /exports/dhsystem
, depending on the version you have installed. The correct folder contains a db
directory.
# Confirm the directory that has the 'db' directory in it
$ kubectl exec deploy/deephaven-nfs-server -- ls -l /exports/exports/dhsystem
drwxr-xr-x 6 root root 4096 Jun 17 14:58 db
Create another folder here named etcd-backup2
and copy the snapshot to it. Note that if etcd disaster recovery was configured, there is probably another folder alongside db
named etcd-backup
.
# Create directory for the etcd snapshot file
$ kubectl exec deploy/deephaven-nfs-server -- mkdir -p /exports/exports/dhsystem/etcd-backup2
# Copy the snapshot file to the new directory
$ mynfspod=$(kubectl get pods -l role=deephaven-nfs-server -o custom-columns='NAME:.metadata.name' --no-headers)
$ kubectl cp etcd-snapshot.db ${mynfspod}:/exports/exports/dhsystem/etcd-backup2/etcd-snapshot.db --retries=10
Create a Persistent Volume and Persistent Volume Claim for the snapshot
A Persistent Volume Claim (PVC) and Persistent Volume (PV) are required to start a new etcd cluster from a snapshot.
The section below is an example of a YAML file for a PVC and PV to use as a template. Read the comments, change the PVC and PV names
to your chosen names (e.g. etcd-dr-snapshot-pv
and etcd-dr-snapshot-pvc
), and change the spec.nfs.server
value to your cluster's
NFS server name or IP address. If you are using other RWX storage, the YAML will differ.
apiVersion: v1
kind: PersistentVolume
metadata:
name: <your-etcd-restore-snapshot-pv-name>
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 10Gi
mountOptions:
- hard
- nfsvers=4.1
nfs:
# This path is correct even if the NFS server has /exports/exports because /exports is exposed as the
# NFS root. This would be the correct path relative to that.
path: /exports/dhsystem/etcd-backup2
# The server value should be: <deephaven-nfs-service-name>.<k8s-namespace>.svc.cluster.local.
# However, the DNS name does not always resolve (e.g. in AKS, EKS), so you can use the IP address directly
# for the server value. Get the NFS service IP with command 'kubectl get svc deephaven-nfs'.
server: deephaven-nfs.<k8s-namespace>.svc.cluster.local
persistentVolumeReclaimPolicy: Retain
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: <your-etcd-restore-snapshot-pvc-name>
spec:
accessModes:
- ReadWriteMany
storageClassName: ''
volumeName: <your-etcd-restore-snapshot-pv-name>
resources:
requests:
storage: 10Gi
Save as a file named etcd-restore-vol.yaml
, then apply the file to create the volume and claim.
kubectl apply -f etcd-restore-vol.yaml
Apply full access permissions to the snapshot directory
The etcd pods writing to the snapshot volume run with user and group 1001:1001
, so the permissions on the snapshot
directory must allow for that. Run the following command to set full permissions.
# Set full permissions on the snapshot directory
kubectl exec -it svc/deephaven-nfs -- bash -c "chmod 777 /exports/exports/dhsystem/etcd-backup2"
# And confirm with this command
kubectl exec -it svc/deephaven-nfs -- bash -c "ls -ld /exports/exports/dhsystem/etcd-backup2"
Load the etcd Docker images to your repository
Ensure that your repository has the etcd Docker images available. If they need to be added to your repository, see this section of the installation guide. If you do not have the bitnami-etcd-containers-11.3.6.tar.gz
file, contact Deephaven support to obtain it.
Install the new etcd cluster
You are now ready to install the etcd Helm chart. The following command specifies properties to restore the cluster from a snapshot. Substitute a name for your new etcd installation (e.g., dh-etcd2
), your repository name (e.g., urco-docker.pkg.dev/deephaven/images
), and the password from your original etcd cluster.
Note
The image registry, repository and tag values used below will be combined to form the full image url used in the pods.
If your repo has the images stored at urco-docker.pkg.dev/deephaven/images/bitnami/etcd:3.5.21-debian-12-r5
then you will use image.registry=urco-docker.pkg.dev/deephaven/images
below, and the image.repository
and image.tag
values will remain as shown.
# Note that the image registry, repository and tag values will be combined to form the full image
# url, so if your repo has urco-docker.pkg.dev/deephaven/images/bitnami/etcd:3.5.21-debian-12-r5
# then you would use image.registry=urco-docker.pkg.dev/deephaven/images
helm install <new-helm-install-name> bitnami-etcd-helm-11.3.6.tgz \
--set image.registry=<repository-url-and-path> \
--set image.repository=bitnami/etcd \
--set image.tag=3.5.21-debian-12-r5 \
--set global.security.allowInsecureImages=true \
--set startFromSnapshot.enabled=true \
--set startFromSnapshot.existingClaim=<your-backup-pvc-name> \
--set startFromSnapshot.snapshotFilename=etcd-snapshot.db \
--set disasterRecovery.cronjob.schedule="*/10 * * * *" \
--set disasterRecovery.cronjob.snapshotHistoryLimit=1 \
--set disasterRecovery.enabled=true \
--set replicaCount=3 \
--set auth.rbac.enabled=true \
--set resourcesPreset=medium \
--set auth.rbac.rootPassword=<your-plaintext-root-password> \
--timeout 10m
Monitor the installation with kubectl get pods -l app.kubernetes.io/instance=<new-helm-install-name> -w
. Verify that all pods have a status of Running
and the ready column shows 1/1
, indicating 1 out of 1 container is in a ready state. This may take a few minutes, depending on how much data is in the restore file.
Updating Deephaven for the new etcd cluster name
The endpoints data in Deephaven must now be updated to point to the new cluster. You will need the non-headless etcd service URL for the new cluster.
# There will be two services for the etcd cluster, we need the non-headless service name
kubectl get svc -l app.kubernetes.io/instance=<new-helm-install-name>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
<new-etcd-service-name> ClusterIP 10.122.9.149 <none> 2379/TCP,2380/TCP 6m31s
<new-etcd-service-name>-headless ClusterIP None <none> 2379/TCP,2380/TCP 6m31s
The format for the endpoint is http://<etcd-service-name>.<k8s-namespace>.svc.cluster.local:2379
. Substitute your values to formulate the endpoint for your environment. For example, if your etcd service name is dh-etcd2
and your k8s namespace is dhns
, the endpoint would be http://dh-etcd2.dhns.svc.cluster.local:2379
.
Because Kubernetes secrets store data in base64-encoded format, encode your endpoint with the following command:
# Base64 encode the etcd non-headless endpoint, substituting your values for the
# new helm install name and k8s namespace
echo -n http://<new-etcd-service-name>.<k8s-namespace>.svc.cluster.local:2379 | base64
aHR0cDov...bDoyMzc5
Use this snippet to update the secrets with your encoded endpoint value.
# Update the secrets with the new encoded endpoint value
my_secrets=$(kubectl get secret -l role=etcd-client-cred -o custom-columns='_:.metadata.name' --no-headers)
my_new_endpoint="aHR0cDov...bDoyMzc5" # Substitute your base64 encoded endpoint value here
for s in $my_secrets; do
kubectl patch secrets $s --type="json" -p="[{"op": "replace", "path": "/data/endpoints", "value": "${my_new_endpoint}"}]"
done
You are now ready to perform the final Helm upgrade of Deephaven.
When Deephaven was initially installed, you most likely used a YAML file to define Helm chart values for your environment (such as the Deephaven URL, Docker image repository, and NFS server information). That file contains a YAML value for etcd.release
with your original etcd installation name. Change that value to your new etcd installation name.
If you do not have that file, retrieve the values by running helm get values <deephaven installation name>
. If you do not know the Deephaven Helm installation name, run helm list
.
The Deephaven Helm installation package includes a setupTools/scaleAll.sh
script. Scale the deployments to 0:
setupTools/scaleAll.sh 0
Then do the Helm upgrade as you normally would, as described here.