Defragment an etcd node

Aside from compaction, you can run defragmentation on an etcd node to reclaim internal space that is not used by the database. This process is similar to reclaiming file system fragmentation. Fragmentation in etcd occurs as a result of updates to the database, leaving "holes" in the database file with unused space. This unused space is not wasted; over time, it gets used for new data writes.

Given how Deephaven uses etcd, small to medium-sized installations are unlikely to need defragmentation, as the organic growth of the database should use up fragmented space over time. The need to defragment would only arise in large installations or when very large updates to schemas and/or properties are done periodically.

Warning

Fragmentation does not negatively impact etcd performance; it only affects disk space utilization. However, running defragmentation has risks. Ideally, avoid the need for defragmentation by ensuring there is enough spare disk space available for etcd.

Defragment a single node

Nodes cannot perform writes to the database during defragmentation. This restriction applies not only to the node performing the defragmentation but to the entire cluster. In moderately large databases, defragmentation can take several minutes to complete. Avoid running defragmentation during regular system operation; instead, run it only when the Deephaven system is down.

  1. Pick one machine and stop the running etcd process, by stopping its service:

    sudo systemctl stop dh-etcd.service
    

    Note that this is risky: it removes a node from the cluster. For example, a 3-node cluster will be left with only 2 nodes after this operation, meaning a single additional node failure would cause the entire cluster to fail. In contrast, a 5-node cluster can tolerate an additional node failure, making it a much safer option.

  2. The path /var/lib/etcd/dh/cd01a0636 is the configured data-dir specified in the config.yaml file located at /etc/etcd/dh/cd01a0636/config.yaml. This path is also shown in the ps output for etcd when it is running. Ensure that the member/snap subdirectory exists; if it doesn't, create it.

    Note

    cd01a636 is an auto-generated ID. This ID will likely be different in your installation.

    sudo -u etcd -g etcd mkdir -p /var/lib/etcd/dh/cd01a0636/member/snap
    

    As the etcd user, run the etcdctl defrag command on the database file for the node. Note that the node is not running, so this operation will be performed directly on the file. Also, use the native etcd command etcdctl instead of Deephaven's wrapper etcdctl.sh, since the etcd server on the machine is down.

    sudo -u etcd -g etcd /bin/etcdctl defrag --data-dir=/var/lib/etcd/dh/cd01a0636
    
  3. Everything is now in place to restart the etcd node.

    sudo systemctl start dh-etcd.service
    

    Once restarted, the node will rejoin the cluster and catch up on any changes it may have missed. Use the etcdctl.sh endpoint status command to check the status of the cluster and verify if the raft term of the restarted node has caught up.

Defragment all nodes

Follow the instructions to defrag a single node, one node at a time. After each node is defragmented, check the cluster health using:

etcdctl.sh endpoints status