Kubernetes etcd Bbackup and recovery
Etcd backup and restore procedures in a Kubernetes installation differ slightly from the procedures in a default installation.. Note that this page refers to the etcd installation used solely by Deephaven for its own purposes and is completely separate from the etcd instance used by the Kubernetes system.
Prerequisites
The Helm installation name, Kubernetes namespace, and root etcd password are needed to backup and restore the etcd cluster.
To find the Helm install name, run helm list
and look for the etcd listing. For the example commands on this page,
the etcd Helm install name is stor
, the namespace is deephaven
, and the root password is plaintext-root-password
. Substitute your system's
values for them when operating in your cluster.
$ helm list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
stor deephaven 1 2024-03-14 14:58:50.304509 -0400 EDT deployed etcd-9.14.2 3.5.12
The etcd root user password is stored in a secret.
# Find the name of the secret for the etcd root password
$ kubectl get secrets -l app.kubernetes.io/component=etcd
NAME TYPE DATA AGE
stor-etcd Opaque 1 100d
# Get the root password value
$ kubectl get secret stor-etcd -o jsonpath='{.data.etcd-root-password}' | base64 -d
plaintext-root-password
Taking a snapshot of etcd
Use one of the nodes in your etcd cluster to take a snapshot and store it temporarily.
# List all pods in the etcd cluster
$ kubectl get pods -l app.kubernetes.io/instance=stor
NAME READY STATUS RESTARTS AGE
stor-etcd-0 1/1 Running 0 100d3h
stor-etcd-1 1/1 Running 0 100d8h
stor-etcd-2 1/1 Running 0 100d9h
# Run command to save an etcd snapshot to a file on one of the nodes. The root password is required.
$ kubectl exec stor-etcd-0 -- etcdctl --user root:plaintext-root-password snapshot save /tmp/etcd-snapshot.db
Copy the snapshot from the etcd node to your local filesystem. After the snapshot file is copied, it is no longer needed on the etcd node and can be deleted.
# Copy the snapshot from the etcd node to your local filesystem
$ kubectl cp stor-etcd-0:/tmp/etcd-snapshot.db etcd-snapshot.db
# Delete the snapshot once it is safely copied
$ kubectl exec stor-etcd-0 -- rm /tmp/etcd-snapshot.db
Restoring from a snapshot of etcd
To start a new etcd cluster that restores from a snapshot file, the snapshot file must be on a read-write-many (RWX) persistent volume. This example uses the NFS deployment that is part of the Deephaven installation, though it can be any other RWX storage available to you.
On the NFS pod, there should be a folder named either /exports/exports/dhsystem
, or just /exports/dhsystem
, depending on
the version you have installed. The correct folder has a db
directory.
# Confirm the directory that has the 'db' directory in it
$ kubectl exec deploy/deephaven-nfs-server -- ls -l /exports/exports/dhsystem
drwxr-xr-x 6 root root 4096 Jun 17 14:58 db
Create another folder alongside db
and copy the snapshot to it.
# Create directory for the etcd snapshot file
$ kubectl exec deploy/deephaven-nfs-server -- mkdir -p /exports/exports/dhsystem/etcd-snap/restore
# Copy the snapshot file to the new directory
$ mynfspod=$(kubectl get pods -l role=deephaven-nfs-server -o custom-columns='NAME:.metadata.name' --no-headers)
$ kubectl cp etcd-snapshot.db ${mynfspod}:/exports/exports/dhsystem/etcd-snap/restore/etcd-snapshot.db
A persistent volume claim (PVC) and persistent volume (PV) are required to start a new etcd cluster from a snapshot.
Below is an example of a YAML file for a PVC and PV to use as a template. See the # comments, and change the
spec.nfs.server
value to the NFS server name or IP address for your cluster. Remove the comments from your file and
save it as etcd-restore-vol.yaml
, then create the volume and claim by running kubectl apply -f etcd-restore-vol.yaml
.
PV and PVC yaml example for etcd restore
apiVersion: v1
kind: PersistentVolume
metadata:
name: my-etcd-restore-snapshot-pv
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 10Gi
mountOptions:
- hard
- nfsvers=4.1
nfs:
# This path is correct even if the NFS server has /exports/exports because /exports exposed as the
# NFS root. This would be the correct path relative to that.
path: /exports/dhsystem/etcd-snap/restore
# The server value should be: <deephaven-nfs-service-name>.<your-k8s-namespace>.svc.cluster.local.
# However, the DNS name does not always resolve (AKS, EKS), so you can use the IP address directly
# for the server value. Get the NFS service IP with 'kubectl get svc deephaven-nfs'.
server: deephaven-nfs.deephaven.svc.cluster.local
persistentVolumeReclaimPolicy: Retain
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-etcd-restore-snapshot-pvc
spec:
accessModes:
- ReadWriteMany
storageClassName: ''
volumeName: my-etcd-restore-snapshot-pv
resources:
requests:
storage: 10Gi
You are now ready to install the etcd Helm chart. This is similar to when Deephaven is initially installed,
with a few more properties specified so that it can be restored from a snapshot. Substitute the name of your new etcd installation
for the stor2
name used in this example. Note that you need the etcd root password as described above.
$ helm install stor2 bitnami/etcd -f dev-values/etcdValues.yaml \
--set startFromSnapshot.enabled=true \
--set startFromSnapshot.existingClaim=my-etcd-restore-snapshot-pvc \
--set startFromSnapshot.snapshotFilename=etcd-snapshot.db \
--set replicaCount=3 \
--set auth.rbac.enabled=true \
--set auth.rbac.rootPassword=plaintext-root-password
Updating Deephaven for the new etcd cluster name
The endpoints data in Deephaven needs to be changed to point to the new cluster. For the new configuration, we need
the non-headless etcd service URL for the new cluster obtained from running kubectl get svc -l app.kubernetes.io/instance=stor2
,
using the name of your installation in place of stor2
.
# Get the non-headless service name for the new etcd instance
$ kubectl get svc -l app.kubernetes.io/instance=stor2
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
stor2-etcd ClusterIP 10.122.9.149 <none> 2379/TCP,2380/TCP 26d
stor2-etcd-headless ClusterIP None <none> 2379/TCP,2380/TCP 26d
The format for the endpoint is http://<etcd-k8s-service-name>.<namespace>.svc.cluster.local:2379
. For this example,
that is http://stor2.deephaven.svc.cluster.local:2379
. Substitute the values for your environment.
The next step is to update some of our secrets with this information. Data in Kubernetes secrets is base64-encoded, so first, encode the endpoint.
# Base64 encode the etcd non-headless endpoint
$ echo -n http://stor2.deephaven.svc.cluster.local:2379 | base64
aHR0cDovL3N0b3IyLmRlZXBoYXZlbi5zdmMuY2x1c3Rlci5sb2NhbDoyMzc5
Use this snippet to update the secrets with your encoded endpoint value.
# Update the secrets with the new encoded endpoint value
$ secret_names=$(kubectl get secret -l role=etcd-client-cred -o custom-columns='NAME:.metadata.name' --no-headers)
$ for sn in $secret_names; do
> kubectl patch secrets $sn --type='json' -p='[{"op" : "replace" ,"path" : "/data/endpoints" ,"value" : "aHR0cDovL3N0b3IyLmRlZXBoYXZlbi5zdmMuY2x1c3Rlci5sb2NhbDoyMzc5"}]'
> done
We are now ready to do the final Helm upgrade of Deephaven. You most likely used a YAML file to define Helm chart token
values for your environment when Deephaven was initially installed. These include things such as the URL of the Deephaven
installation, the Docker image repository, and the Deephaven NFS server information. That file contains a YAML value for
etcd.release
with your original etcd installation name. Change that to your new etcd installation name.
If you do not have that file, you can get the values provided by running helm get notes <deephaven installation name>
.
If you do not know the Deephaven Helm installation name, run helm list
.
The Deephaven Helm installation package includes a setupTools/scaleAll.sh
script. Scale the deployments to 0:
setupTools/scaleAll.sh 0
Then do the Helm upgrade as you normally would as described here.