etcd runbook
etcd is a distributed key-value store that serves as the foundational data store for all critical Deephaven cluster configuration and state. It provides strong consistency guarantees through the Raft consensus algorithm and is essential for cluster operation.
Impact of etcd failure
| Level | Impact |
|---|---|
| Sev 1 - Critical | Schema, Persistent Queries, property files, routing configuration, and optionally ACLs are stored in etcd. etcd is used as a shared store for Authentication and Dispatcher runtime processing. Without etcd, the Deephaven system cannot function. |
etcd dependencies
etcd has no dependencies on other Deephaven services. It is a standalone third-party service that must be running before any Deephaven processes can start.
Network requirements:
- Client port (default: 2379) — Used by Configuration Server and other Deephaven processes.
- Peer port (default: 2380) — Used for etcd cluster member communication.
- All etcd nodes must be able to communicate with each other on the peer port.
- The Configuration Server must be able to reach all etcd nodes on the client port.
etcd deployment architecture
Quorum requirements: etcd requires a strict majority (>50%) of cluster members to be available for the cluster to function.
Recommended deployment patterns:
- Single node (testing only): 1 instance — No redundancy, any failure causes total outage.
- Standard production: 3 instances — Tolerates 1 node failure.
- High availability: 5 instances — Tolerates 2 node failures.
Important: Always use an odd number of nodes. Even numbers provide no additional fault tolerance (e.g., 4 nodes still only tolerates 1 failure, same as 3 nodes).
etcd client configuration
Deephaven processes access etcd through "etcd client configuration files" located at /etc/sysconfig/deephaven/etcd/client. These configuration directories contain:
endpoints— List of etcd node addresses and portsuser— etcd RBAC username for this rolepassword— etcd RBAC passwordcacert— Certificate authority for TLS verification
Available etcd client roles:
root— Full administrative access (used by default inetcdctl.sh)schema-rw— Read/write access to table schemasschema-ro— Read-only access to schemaspq-rw— Read/write access to Persistent Query definitions- Additional roles for ACLs, routing, properties, etc.
The Configuration Server uses these client configuration files to communicate with etcd on behalf of other Deephaven services.
Checking etcd status
Check process is running with systemctl:
Check endpoint status (cluster health):
Expected output shows all cluster members, their IDs, versions, database sizes, and leader status:
Check cluster member health:
Test connectivity and authentication:
Viewing etcd logs
View all logs:
Follow logs in real-time:
View logs from the last hour:
Restart procedure
Restart the etcd service:
Important: When restarting multiple etcd nodes, always maintain quorum:
- For a 3-node cluster, never restart more than 1 node at a time.
- For a 5-node cluster, never restart more than 2 nodes at a time.
- Wait for the restarted node to rejoin and sync before restarting another.
Verify the node has rejoined after restart:
Using etcdctl.sh
The etcdctl.sh script is a thin wrapper around etcdctl that passes in the correct username and password for a given Deephaven etcd role. Each user's credentials are stored in /etc/sysconfig/deephaven/etcd/client/<user>.
By default, the script uses the root user. To change the user, set the DH_ETCD_USER environment variable or specify the directory manually with DH_ETCD_DIR.
Examples:
Get a specific schema with the schema-ro user:
Or equivalently:
List all keys under a prefix:
Monitoring etcd disk usage
Show current disk usage per node:
Warning signs:
- Database size approaching quota (default: 2GB)
- Database size growing rapidly without expected configuration changes
- Any node showing significantly different size than others
etcd backup and restore
Backup procedure
Create a snapshot backup of etcd:
Verify the snapshot:
Best practices:
- Take daily automated backups during maintenance windows.
- Store backups on separate storage from etcd data directory.
- Retain backups according to your data retention policy.
- Test restore procedures regularly.
Restore procedure
Warning
Restoring etcd will revert all cluster configuration to the backup point. This should only be done in disaster recovery scenarios.
- Stop all Deephaven processes that depend on etcd (essentially all processes).
- Stop the etcd service on all nodes.
- Restore from snapshot (consult etcd documentation for multi-node restore).
- Start etcd on all nodes.
- Verify cluster health before starting Deephaven processes.
Configuration files
systemd service file: /etc/systemd/system/dh-etcd.service
etcd data directory: Typically /var/lib/etcd/dh/ (verify in service file)
etcd configuration: Passed as command-line arguments in systemd service file
Client configuration directory: /etc/sysconfig/deephaven/etcd/client/