---
id: runbook-etcd
title: etcd runbook
---

[etcd](../core-components/etcd.md) is a distributed key-value store that serves as the foundational data store for all critical Deephaven cluster configuration and state. It provides strong consistency guarantees through the [Raft consensus algorithm](https://raft.github.io/) and is essential for cluster operation.

## Impact of etcd failure

| Level            | Impact                                                                                                                                                                                                                                              |
| :--------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Sev 1 - Critical | Schema, Persistent Queries, property files, routing configuration, and optionally ACLs are stored in etcd. etcd is used as a shared store for Authentication and Dispatcher runtime processing. Without etcd, the Deephaven system cannot function. |

## etcd dependencies

etcd has no dependencies on other Deephaven services. It is a standalone third-party service that must be running before any Deephaven processes can start.

**Network requirements:**

- Client port (default: 2379) — Used by [Configuration Server](./runbook-config-server.md) and other Deephaven processes.
- Peer port (default: 2380) — Used for etcd cluster member communication.
- All etcd nodes must be able to communicate with each other on the peer port.
- The Configuration Server must be able to reach all etcd nodes on the client port.

## etcd deployment architecture

**Quorum requirements:** etcd requires a strict majority (>50%) of cluster members to be available for the cluster to function.

**Recommended deployment patterns:**

- **Single node (testing only):** 1 instance — No redundancy, any failure causes total outage.
- **Standard production:** 3 instances — Tolerates 1 node failure.
- **High availability:** 5 instances — Tolerates 2 node failures.

**Important:** Always use an odd number of nodes. Even numbers provide no additional fault tolerance (e.g., 4 nodes still only tolerates 1 failure, same as 3 nodes).

## etcd client configuration

Deephaven processes access etcd through "etcd client configuration files" located at `/etc/sysconfig/deephaven/etcd/client`. These configuration directories contain:

- `endpoints` — List of etcd node addresses and ports
- `user` — etcd RBAC username for this role
- `password` — etcd RBAC password
- `cacert` — Certificate authority for TLS verification

**Available etcd client roles:**

- `root` — Full administrative access (used by default in `etcdctl.sh`)
- `schema-rw` — Read/write access to table schemas
- `schema-ro` — Read-only access to schemas
- `pq-rw` — Read/write access to Persistent Query definitions
- Additional roles for ACLs, routing, properties, etc.

The Configuration Server uses these client configuration files to communicate with etcd on behalf of other Deephaven services.

## Checking etcd status

Check process is running with systemctl:

```bash
sudo systemctl status dh-etcd
```

Check endpoint status (cluster health):

```bash
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out table
```

Expected output shows all cluster members, their IDs, versions, database sizes, and leader status:

```
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://node1:2379     | a1b2c3d4e5f6g7h8 |   3.5.x |  xx MB  | true      | false      |         x |      xxxxx |              xxxxx |        |
| https://node2:2379     | b2c3d4e5f6g7h8i9 |   3.5.x |  xx MB  | false     | false      |         x |      xxxxx |              xxxxx |        |
| https://node3:2379     | c3d4e5f6g7h8i9j0 |   3.5.x |  xx MB  | false     | false      |         x |      xxxxx |              xxxxx |        |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
```

Check cluster member health:

```bash
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint health --write-out table
```

Test connectivity and authentication:

```bash
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh role list
```

## Viewing etcd logs

View all logs:

```bash
sudo journalctl -xu dh-etcd
```

Follow logs in real-time:

```bash
sudo journalctl -xefu dh-etcd
```

View logs from the last hour:

```bash
sudo journalctl -xu dh-etcd --since "1 hour ago"
```

## Restart procedure

Restart the etcd service:

```bash
sudo systemctl restart dh-etcd
```

**Important:** When restarting multiple etcd nodes, always maintain quorum:

- For a 3-node cluster, never restart more than 1 node at a time.
- For a 5-node cluster, never restart more than 2 nodes at a time.
- Wait for the restarted node to rejoin and sync before restarting another.

Verify the node has rejoined after restart:

```bash
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out table
```

## Using etcdctl.sh

The `etcdctl.sh` script is a thin wrapper around [`etcdctl`](https://etcd.io/docs/v3.5/dev-guide/interacting_v3/) that passes in the correct username and password for a given Deephaven etcd role. Each user's credentials are stored in `/etc/sysconfig/deephaven/etcd/client/<user>`.

By default, the script uses the `root` user. To change the user, set the `DH_ETCD_USER` environment variable or specify the directory manually with `DH_ETCD_DIR`.

**Examples:**

Get a specific schema with the `schema-ro` user:

```bash
sudo -u irisadmin DH_ETCD_USER=schema-ro /usr/illumon/latest/bin/etcdctl.sh get --prefix /main/config/schema/DbInternal/tables/AuditEventLog
```

Or equivalently:

```bash
sudo -u irisadmin DH_ETCD_DIR=/etc/sysconfig/deephaven/etcd/client/schema-ro /usr/illumon/latest/bin/etcdctl.sh get --prefix /main/config/schema/DbInternal/tables/AuditEventLog
```

List all keys under a prefix:

```bash
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh get --prefix /main/config/ --keys-only
```

## Monitoring etcd disk usage

Show current disk usage per node:

```bash
sudo /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out=table
```

**Warning signs:**

- Database size approaching quota (default: 2GB)
- Database size growing rapidly without expected configuration changes
- Any node showing significantly different size than others

## etcd backup and restore

### Backup procedure

Create a snapshot backup of etcd:

```bash
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db
```

Verify the snapshot:

```bash
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh snapshot status /backup/etcd-snapshot-*.db --write-out=table
```

**Best practices:**

- Take daily automated backups during maintenance windows.
- Store backups on separate storage from etcd data directory.
- Retain backups according to your data retention policy.
- Test restore procedures regularly.

### Restore procedure

> [!WARNING]
> Restoring etcd will revert all cluster configuration to the backup point. This should only be done in disaster recovery scenarios.

1. Stop all Deephaven processes that depend on etcd (essentially all processes).
2. Stop the etcd service on all nodes.
3. Restore from snapshot (consult etcd documentation for multi-node restore).
4. Start etcd on all nodes.
5. Verify cluster health before starting Deephaven processes.

## Configuration files

**systemd service file:** `/etc/systemd/system/dh-etcd.service`

**etcd data directory:** Typically `/var/lib/etcd/dh/` (verify in service file)

**etcd configuration:** Passed as command-line arguments in systemd service file

**Client configuration directory:** `/etc/sysconfig/deephaven/etcd/client/`

## Related documentation

- [Introduction to etcd](../core-components/etcd.md)
- [Troubleshooting etcd](../troubleshooting/troubleshooting-etcd.md)
- [etcd recovery guide](../ops-guide/etcd-recovery.md)
- [etcd security hardening](../security/hardening-technical-controls.md)
- [System processes overview](../architecture/architecture-overview.md)
- [Configuration Server runbook](runbook-config-server.md)
