Deephaven process runbooks

This landing page provides access to comprehensive operational runbooks for managing Deephaven system processes. The 16 service-specific runbooks are organized by severity to help prioritize incident response. Each runbook includes detailed procedures for status checking, log viewing, restart procedures, configuration, troubleshooting, and performance tuning.

Before diving into individual runbooks, review the System processes overview to understand how all services interact.

Incident classification key

Use this severity classification to prioritize incident response:

SeverityDescription
0 - NoneProcess is running (or down as scheduled).
1 - CriticalProcess is down when it should be up.
2 - ModerateProcess is up when it should be down; or process is up but configuration is missing.
3 - LowProcess is running but producing errors or performing poorly.

Critical services (Sev 1)

These services are essential for cluster operation. Failure of any critical service causes immediate operational impact:

Supporting services (Sev 2)

These services support operations but their failure has moderate impact:

Infrastructure services

Core infrastructure and optional services:

Using these runbooks

Each runbook follows a consistent structure:

  1. Impact assessment — Severity classification and failure impact.
  2. Service overview — Purpose, responsibilities, and architecture.
  3. Dependencies — What the service requires to function.
  4. Status checking — Commands to verify service health.
  5. Log viewing — How to access and interpret logs.
  6. Restart procedures — Safe restart steps with warnings.
  7. Configuration — Key properties and settings.
  8. Troubleshooting — Common symptoms with check/resolution steps.
  9. Performance tuning — Optimization guidance.
  10. Related documentation — Links to additional resources.

Quick reference commands

Common operations across all services: