---
id: runbooks
title: Deephaven process runbooks
sidebar_label: Process runbooks
---

This landing page provides access to comprehensive operational runbooks for managing Deephaven system processes. The 16 service-specific runbooks are organized by severity to help prioritize incident response. Each runbook includes detailed procedures for status checking, log viewing, restart procedures, configuration, troubleshooting, and performance tuning.

Before diving into individual runbooks, review the [System processes overview](../architecture/architecture-overview.md) to understand how all services interact.

## Incident classification key

Use this severity classification to prioritize incident response:

| Severity     | Description                                                                          |
| :----------- | :----------------------------------------------------------------------------------- |
| 0 - None     | Process is running (or down as scheduled).                                           |
| 1 - Critical | Process is down when it should be up.                                                |
| 2 - Moderate | Process is up when it should be down; or process is up but configuration is missing. |
| 3 - Low      | Process is running but producing errors or performing poorly.                        |

## Critical services (Sev 1)

These services are essential for cluster operation. Failure of any critical service causes immediate operational impact:

- [etcd runbook](runbook-etcd.md) — Distributed key-value store for all cluster configuration and state.
- [Authentication Server runbook](runbook-authentication-server.md) — User authentication and JWT token management.
- [Persistent Query Controller runbook](runbook-pq-controller.md) — Manages lifecycle of Persistent Queries with leader election.
- [Remote Query Dispatcher runbook](runbook-remote-query-dispatcher.md) — Spawns and manages worker processes (db_query, db_merge).
- [Data Import Server runbook](runbook-data-import-server.md) — Ingests streaming data and persists to Parquet format.
- [Web API Server runbook](runbook-web-api-server.md) — Hosts web UI, brokers connections, and serves Client Update Service.

## Supporting services (Sev 2)

These services support operations but their failure has moderate impact:

- [Log Aggregator Service runbook](runbook-log-aggregator-service.md) — Serializes log writes from multiple workers to prevent contention.
- [ACL Write Server runbook](runbook-acl-write-server.md) — Administrative interface for managing users, groups, and permissions.
- [Data Tailer runbook](runbook-data-tailer.md) — Monitors binary log files and streams to Data Import Server.
- [Local Table Data Server runbook](runbook-local-table-data-server.md) — Serves historical intraday data from local filesystem.
- [Table Data Cache Proxy runbook](runbook-table-data-cache-proxy.md) — Caches table data to reduce load on upstream servers.
- [Status Dashboard runbook](runbook-status-dashboard.md) — Grafana-based monitoring and Prometheus metrics endpoint.

## Infrastructure services

Core infrastructure and optional services:

- [Configuration Server runbook](runbook-config-server.md) — Mediates all access to etcd for cluster configuration.
- [Envoy runbook](runbook-envoy.md) — Optional reverse proxy for unified cluster ingress.
- [MySQL runbook](runbook-mysql.md) — Legacy ACL database (optional, replaced by etcd in modern deployments).
- [monit runbook](runbook-monit.md) — Process supervision tool for traditional and Podman deployments.

## Using these runbooks

Each runbook follows a consistent structure:

1. **Impact assessment** — Severity classification and failure impact.
2. **Service overview** — Purpose, responsibilities, and architecture.
3. **Dependencies** — What the service requires to function.
4. **Status checking** — Commands to verify service health.
5. **Log viewing** — How to access and interpret logs.
6. **Restart procedures** — Safe restart steps with warnings.
7. **Configuration** — Key properties and settings.
8. **Troubleshooting** — Common symptoms with check/resolution steps.
9. **Performance tuning** — Optimization guidance.
10. **Related documentation** — Links to additional resources.

## Quick reference commands

Common operations across all services:

```bash
# Check service status
dh_monit status <service_name>

# View current log
tail -f /var/log/deephaven/<service>/<Service>.log.current

# Restart service
dh_monit restart <service_name>

# Check etcd health
sudo -u irisadmin /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out table
```

## Related documentation

- [System processes overview](../architecture/architecture-overview.md)
- [Operations guide](../ops-guide/ops-guide-overview.md)
- [Troubleshooting guides](../troubleshooting/support.md)
- [High availability and resiliency](../architecture/resilience-planning/resilience-planning-overview.md)
- [Permissions overview](../permissions/permissions-overview.md)
