etcd runbook

etcd is a distributed key-value store that serves as the foundational data store for all critical Deephaven cluster configuration and state. It provides strong consistency guarantees through the Raft consensus algorithm and is essential for cluster operation.

Impact of etcd failure

Level	Impact
Sev 1 - Critical	Schema, Persistent Queries, property files, routing configuration, and optionally ACLs are stored in etcd. etcd is used as a shared store for Authentication and Dispatcher runtime processing. Without etcd, the Deephaven system cannot function.

etcd dependencies

etcd has no dependencies on other Deephaven services. It is a standalone third-party service that must be running before any Deephaven processes can start.

Network requirements:

Client port (default: 2379) — Used by Configuration Server and other Deephaven processes.
Peer port (default: 2380) — Used for etcd cluster member communication.
All etcd nodes must be able to communicate with each other on the peer port.
The Configuration Server must be able to reach all etcd nodes on the client port.

etcd deployment architecture

Quorum requirements: etcd requires a strict majority (>50%) of cluster members to be available for the cluster to function.

Recommended deployment patterns:

Single node (testing only): 1 instance — No redundancy, any failure causes total outage.
Standard production: 3 instances — Tolerates 1 node failure.
High availability: 5 instances — Tolerates 2 node failures.

Important: Always use an odd number of nodes. Even numbers provide no additional fault tolerance (e.g., 4 nodes still only tolerates 1 failure, same as 3 nodes).

etcd client configuration

Deephaven processes access etcd through "etcd client configuration files" located at /etc/sysconfig/deephaven/etcd/client. These configuration directories contain:

endpoints — List of etcd node addresses and ports
user — etcd RBAC username for this role
password — etcd RBAC password
cacert — Certificate authority for TLS verification

Available etcd client roles:

root — Full administrative access (used by default in etcdctl.sh)
schema-rw — Read/write access to table schemas
schema-ro — Read-only access to schemas
pq-rw — Read/write access to Persistent Query definitions
Additional roles for ACLs, routing, properties, etc.

The Configuration Server uses these client configuration files to communicate with etcd on behalf of other Deephaven services.

Checking etcd status

Check process is running with systemctl:

sudo systemctl status dh-etcd

Check endpoint status (cluster health):

sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out table

Expected output shows all cluster members, their IDs, versions, database sizes, and leader status:

+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://node1:2379     | a1b2c3d4e5f6g7h8 |   3.5.x |  xx MB  | true      | false      |         x |      xxxxx |              xxxxx |        |
| https://node2:2379     | b2c3d4e5f6g7h8i9 |   3.5.x |  xx MB  | false     | false      |         x |      xxxxx |              xxxxx |        |
| https://node3:2379     | c3d4e5f6g7h8i9j0 |   3.5.x |  xx MB  | false     | false      |         x |      xxxxx |              xxxxx |        |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Check cluster member health:

sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint health --write-out table

Test connectivity and authentication:

sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh role list

Viewing etcd logs

View all logs:

sudo journalctl -xu dh-etcd

Follow logs in real-time:

sudo journalctl -xefu dh-etcd

View logs from the last hour:

sudo journalctl -xu dh-etcd --since "1 hour ago"

Restart procedure

Restart the etcd service:

sudo systemctl restart dh-etcd

Important: When restarting multiple etcd nodes, always maintain quorum:

For a 3-node cluster, never restart more than 1 node at a time.
For a 5-node cluster, never restart more than 2 nodes at a time.
Wait for the restarted node to rejoin and sync before restarting another.

Verify the node has rejoined after restart:

sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out table

Using etcdctl.sh

The etcdctl.sh script is a thin wrapper around etcdctl that passes in the correct username and password for a given Deephaven etcd role. Each user's credentials are stored in /etc/sysconfig/deephaven/etcd/client/<user>.

By default, the script uses the root user. To change the user, set the DH_ETCD_USER environment variable or specify the directory manually with DH_ETCD_DIR.

Examples:

Get a specific schema with the schema-ro user:

sudo -u irisadmin DH_ETCD_USER=schema-ro /usr/illumon/latest/bin/etcdctl.sh get --prefix /main/config/schema/DbInternal/tables/AuditEventLog

Or equivalently:

sudo -u irisadmin DH_ETCD_DIR=/etc/sysconfig/deephaven/etcd/client/schema-ro /usr/illumon/latest/bin/etcdctl.sh get --prefix /main/config/schema/DbInternal/tables/AuditEventLog

List all keys under a prefix:

sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh get --prefix /main/config/ --keys-only

Monitoring etcd disk usage

Show current disk usage per node:

sudo /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out=table

Warning signs:

Database size approaching quota (default: 2GB)
Database size growing rapidly without expected configuration changes
Any node showing significantly different size than others

etcd backup and restore

Backup procedure

Create a snapshot backup of etcd:

sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db

Verify the snapshot:

sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh snapshot status /backup/etcd-snapshot-*.db --write-out=table

Best practices:

Take daily automated backups during maintenance windows.
Store backups on separate storage from etcd data directory.
Retain backups according to your data retention policy.
Test restore procedures regularly.

Restore procedure

Warning

Restoring etcd will revert all cluster configuration to the backup point. This should only be done in disaster recovery scenarios.

Stop all Deephaven processes that depend on etcd (essentially all processes).
Stop the etcd service on all nodes.
Restore from snapshot (consult etcd documentation for multi-node restore).
Start etcd on all nodes.
Verify cluster health before starting Deephaven processes.

Configuration files

systemd service file: /etc/systemd/system/dh-etcd.service

etcd data directory: Typically /var/lib/etcd/dh/ (verify in service file)

etcd configuration: Passed as command-line arguments in systemd service file

Client configuration directory: /etc/sysconfig/deephaven/etcd/client/