Defragment an etcd node
Aside from compaction, you can run defragmentation on an etcd node to reclaim internal space that is not used by the database. This process is similar to reclaiming file system fragmentation. Fragmentation in etcd occurs as a result of updates to the database, leaving "holes" in the database file with unused space. This unused space is not wasted; over time, it gets used for new data writes.
Given how Deephaven uses etcd, small to medium-sized installations are unlikely to need defragmentation, as the organic growth of the database should use up fragmented space over time. The need to defragment would only arise in large installations or when very large updates to schemas and/or properties are done periodically.
Warning
Fragmentation does not negatively impact etcd performance; it only affects disk space utilization. However, running defragmentation has risks. Ideally, avoid the need for defragmentation by ensuring there is enough spare disk space available for etcd.
Defragment a single node
Nodes cannot perform writes to the database during defragmentation. This restriction applies not only to the node performing the defragmentation but to the entire cluster. In moderately large databases, defragmentation can take several minutes to complete. Avoid running defragmentation during regular system operation; instead, run it only when the Deephaven system is down.
-
Pick one machine and stop the running etcd process, by stopping its service:
sudo systemctl stop dh-etcd.service
Note that this is risky: it removes a node from the cluster. For example, a 3-node cluster will be left with only 2 nodes after this operation, meaning a single additional node failure would cause the entire cluster to fail. In contrast, a 5-node cluster can tolerate an additional node failure, making it a much safer option.
-
The path
/var/lib/etcd/dh/cd01a0636
is the configureddata-dir
specified in theconfig.yaml
file located at/etc/etcd/dh/cd01a0636/config.yaml
. This path is also shown in theps
output for etcd when it is running. Ensure that themember/snap
subdirectory exists; if it doesn't, create it.Note
cd01a636
is an auto-generated ID. This ID will likely be different in your installation.sudo -u etcd -g etcd mkdir -p /var/lib/etcd/dh/cd01a0636/member/snap
As the etcd user, run the
etcdctl
defrag command on the database file for the node. Note that the node is not running, so this operation will be performed directly on the file. Also, use the native etcd commandetcdctl
instead of Deephaven's wrapperetcdctl.sh
, since the etcd server on the machine is down.sudo -u etcd -g etcd /bin/etcdctl defrag --data-dir=/var/lib/etcd/dh/cd01a0636
-
Everything is now in place to restart the etcd node.
sudo systemctl start dh-etcd.service
Once restarted, the node will rejoin the cluster and catch up on any changes it may have missed. Use the
etcdctl.sh endpoint status
command to check the status of the cluster and verify if the raft term of the restarted node has caught up.
Defragment all nodes
Follow the instructions to defrag a single node, one node at a time. After each node is defragmented, check the cluster health using:
etcdctl.sh endpoints status