Deephaven status dashboard

Deephaven includes a status dashboard process that provides a Prometheus interface, providing data that a Prometheus installation can scan. Installation of Prometheus is not detailed here, but you can refer to the Prometheus GitHub page for instructions.

Deephaven Application Prometheus Configuration

The following properties (with the shown defaults specified in iris-defaults.prop) specify the basic operation of the Deephaven status dashboard process. The default address on which the dashboard provides data is https://<server's fqdn>:8112/.

Property Name	Property Meaning	Default Value
`StatusDashboard.prometheus.port`	The port on which the Prometheus data is exposed.	8112
`StatusDashboard.prometheus.namespace`	The Prometheus namespace to be used for the data.	Deephaven

SSL and Authentication

By default, the Prometheus web interface uses SSL and requires authentication. The user must log in with a valid Deephaven user, who must be either a superuser or in a group specified by the StatusDashboard.allowedGroups property. If authentication is required, then SSL must be used.

Property Name	Property Meaning	Default Value
`StatusDashboard.useSsl`	If `true`, the Prometheus interface uses https.	`true`
`StatusDashboard.useAuthentication`	If `true`, the Prometheus interface requires authentication.	`true`
`StatusDashboard.allowedGroups`	If authentication is enabled, the user must be a superuser or a member of one of these groups.	`dashboard`

Prometheus Node Exporter

The Prometheus node exporter provides status on various aspects of the host's health, such as available disk space and CPU utilization. Installation and configuration of the node exporter is beyond the scope of Deephaven documentation; a good starting point is the Prometheus node exporter GitHub page. The example dashboard discussed below assumes that a node exporter is configured and running on the server. If the Deephaven installation contains more than one server, then a node exporter should be run on each one.

Prometheus Server

To access the Prometheus interface effectively, configure a Prometheus server. A Prometheus server is usually set up on a system other than the one being monitored. It should be configured to scrape the Deephaven status dashboard's data and any node exporters that have been configured. Instructions for installing and running Prometheus can be found at the Prometheus documentation.

An example Prometheus configuration file is provided in the Deephaven installation in /usr/illumon/latest/etc/prometheus.yml. Copy it somewhere that it won't be overridden on each upgrade, wherever Prometheus is running. This isn't needed if an existing Prometheus installation is being used.

The Prometheus configuration YAML must be edited to point to the appropriate locations. Alternatively, if an existing Prometheus installation is being modified, the file can serve as an example of options to add to the existing file.

Update the targets to point to your server(s). You can monitor multiple installations by using multiple targets. If you use the example Grafana dashboard, each listed server will appear in the dashboard's Server dropdown.
Update the username to match the user you're using for dashboard authentication. Choose a user created explicitly for the dashboard process, not a default superuser, and give it only the required privileges (by default this means only add it to the dashboard group).
Update the etcd section with your etcd server addresses and a reasonable job_name, which will be the value that appears in the etcd dashboard's cluster dropdown. If you want to monitor multiple etcd installations (i.e. different etcd clusters for multiple Deephaven installations), give each one a new entry with its own job_name.

Put the password into a file and update password_file in prometheus.yml to point to that file. This file should be owned by the user that will run Prometheus.

Ensure that the file is only visible to the user running Prometheus. For example, as the user that runs Prometheus:

chmod 600 <location of password file>

Run Prometheus. For example, if you're in the directory where Prometheus is installed and logged in as the user under which it was installed:

./prometheus --config.file=<location of Prometheus configuration yaml file>

Grafana

Grafana is often used to provide visual representations of Prometheus data. Further information, including installation details, can be found at the Grafana website. Once it's installed and running, it can be accessed through a web browser, typically on the address https://<fqdn of grafana server>:3000.

Deephaven recommends setting up Grafana to use https. This changes the address to https://<fqdn of grafana server>:3000. Note that when following the linked instructions, merely adding the new properties to the Grafana property file (typically /etc/grafana/grafana.ini) does not work because the earlier properties with a ; at the beginning of the line cause Grafana to ignore later properties. You must edit the properties and remove the ; from each changed line.

A simple example dashboard is provided in /usr/illumon/latest/etc/grafanaDashboard.json, which relies on the node exporter, as well as the Persistent Query data example below.

It monitors the infrastructure server; on a multiple-node installation this is where the status dashboard process runs.
It monitors two query servers. If you have more or fewer than two, it's best to adjust the dashboard in the Grafana GUI after importing it, by deleting the second query server's panels or copying panels for additional query servers.
It monitors the status of several custom Persistent Queries, both scripts and batch queries, all starting with the name My. You can remove the panels from the dashboard or change and copy them to monitor other queries.
It monitors the lag of the ProcessEventLog and AuditEventLog tables. To use this, you'll need to define Persistent Queries called DataLagWatcherCommunity and DataLagWatcherCommunity with the scripts defined in the data monitoring section.
It only monitors the node exporter running on the same node as the status dashboard process. Adding and monitoring node exporters requires updating the Prometheus configuration file and the Grafana dashboard.

To use the example dashboard:

Set up your data source using the Grafana web GUI. The data source is the location of the Prometheus server, and will be something like http://<fqdn of prometheus server>:9090.
Find the uid of that data source. You should see this in the URL of the Grafana page where you are editing the data source. For example, http://localhost:3000/datasources/edit/d188497b-3325-4890-ac3d-e42a5f0d1351 indicates that the uid is d188497b-3325-4890-ac3d-e42a5f0d1351.
Update all the uid fields in the example Grafana JSON file to be the uid of your data sources.
The example dashboard includes a dropdown for the servers being monitored. If you have multiple Deephaven installations, you can add more entries to the Server dropdown by adding more targets in the Prometheus configuration file.
The templating section includes a regular expression in the regex section to restrict the servers shown in the dropdown to those that include infra-1.fqdn. Update it to match your servers' FQDNs, and update the name if needed. For example, if you have a single server it probably has a different name and infra-1 won't match.
In the Grafana Dashboards menu, select Import, and import the edited JSON file.
You can edit the imported dashboard in the Grafana GUI as needed. For example, it includes two query servers, but if you have only one, you can delete the second server's panels. If you have more than two, you can copy the panels and edit them to point to the additional servers.

Grafana and etcd

You can use Grafana to monitor etcd using the etcd documentation as a starting point. The following edits may be helpful before importing that page's example JSON template into Grafana.

Remove the following line and the corresponding close-brace at the end.

    "etcd.json": {

Assuming you configured your Prometheus installation to monitor the etcd servers, update all the datasource lines to the following, with the UID that you determined above:

                    "datasource": {
                      "type": "prometheus",
                      "uid": "UID of your Prometheus data source",
                    },

Configuration

The status dashboard uses JSON configuration files to determine which certificates, persistent queries, and tables to monitor. These files are provided in a comma-delimited list with the property StatusDashboard.configuration.files. For example:

StatusDashboard.configuration.files=status-dashboard-defaults.json,status-dashboard-custom.json

status-dashboard-defaults.json is provided by Deephaven and should not be edited. Additional files can be imported like any other property files with the dhconfig utility.

Each JSON file uses the following format. The formats of each monitor entry type is described below.

{
  "DashboardMonitors": {
    <monitor entries>
  }
}

As many entries as needed can be placed in a single file as long as the JSON format is maintained.

Monitored Processes

Persistent Query Controller

The status dashboard provides automatic monitoring of the Persistent Query controller, which should always be running. Once it's connected to the controller, it will start monitoring any remote query dispatchers the controller knows about.

Remote Query Dispatchers

Once the status dashboard has connected to the Persistent Query controller, it monitors all the dispatchers configured in the controller.

Persistent Query Status

The dashboard can monitor the state of any Persistent Query (i.e., determine whether it's running/completed or not). This is controlled by JSON configuration using the following properties in a PQMonitors block.

Property Name	Property Meaning
`Name`	The name of this monitor. This is only used for logging.
`PqOwners`	An array containing the owners of the persistent queries to be monitored.
`PqNames`	An array containing the names of the persistent queries to be monitored.
`PqNameMatches`	An array containing regular expressions used to match persistent queries to be monitored.
`PrometheusPublisherPrefix`	An optional prefix added to the Prometheus gauge name

The following example (from the default configuration) monitors the helper queries with a gauge prefix of PQ_. The final comma is because there is an additional entry following this PQMonitors entry.

    "PQMonitors": [
      {
        "Name": "HelperQueries",
        "PrometheusPublisherPrefix": "PQ_",
        "PqNames": ["WebClientData", "RevertHelperQuery", "ImportHelperQuery", "TelemetryHelperQuery"]
      }
    ],

The following example adds two monitors.

The first monitors all persistent queries owned by user1 with the gauge name prefix USER1_.
The second monitors all persistent queries that have names ending in DataGenerator.

    "PQMonitors": [
      {
        "Name": "User1",
        "PqOwners": ["user1"],
        "PrometheusPublisherPrefix": "USER1"
      },
      {
        "Name": "AllDataGenerators",
        "PqNameMatches": ["^.*DataGenerator$"]
      }
    ]

Persistent Query Data

The status dashboard can monitor data latency by subscribing to a Persistent Query's data. The Persistent Query must publish a timestamp column containing the time of the table's most recent update. The following Groovy script does this for the Process Event Log and Audit Event Log tables.

import io.deephaven.engine.util.SortedBy

PELWatch = SortedBy.sortedLastBy(db.liveTable("DbInternal", "ProcessEventLog").where("Date=today()").view("Timestamp"), "Timestamp")
AELWatch = SortedBy.sortedLastBy(db.liveTable("DbInternal", "AuditEventLog").where("Date=today()").view("Timestamp"), "Timestamp")

import com.illumon.iris.db.util.SortedBy

PELWatchBase=SortedBy.sortedLastBy(db.i("DbInternal", "ProcessEventLog").where("Date=currentDateNy()").view("Timestamp"), "Timestamp")
AELWatchBase=SortedBy.sortedLastBy(db.i("DbInternal", "AuditEventLog").where("Date=currentDateNy()").view("Timestamp"), "Timestamp")

PELWatch=PELWatchBase.preemptiveUpdatesTable(1000)
AELWatch=AELWatchBase.preemptiveUpdatesTable(1000)

The status dashboard uses JSON configuration files to determine what data to publish for Prometheus to scrape. Several options are available, and at least one Persistent Query restriction (Persistent Query owners, names, or name-matches) must be provided, as well as at least one table restriction (table names or name-matches).

Property Name	Property Meaning
`Name`	The name of this monitor. This is only used for logging.
`PqOwners`	An array containing the owners of the persistent queries to be monitored.
`PqNames`	An array containing the names of the persistent queries to be monitored.
`PqNameMatches`	An array containing regular expressions used to match persistent queries to be monitored.
`TableNames`	An array containing the names of the tables to be monitored.
`TableNameMatches`	An array containing regular expressions used to match table names to be monitored.
`TimestampColumnName`	The name of the column containing the timestamp. If this isn't provided, the column name `Timestamp` will be used.
`JobIntervalMillis`	The number of milliseconds between examining the published data. If this isn't provided, a default of 30 seconds is used.
`PrometheusPublisherPrefix`	An optional prefix added to the Prometheus gauge name

The following example monitors any iris-owned persistent queries with names ending in DataLagWatcher, lookng for table names ending in Watch, publishing the data every five seconds. If the example scripts were saved with appropriate names, this would monitor their tables and create gauges for them.

    "DataMonitors": [
      {
        "Name": "InternalTables",
        "PqOwners": ["iris"],
        "PqNameMatches": ["^DataLagWatcher.*"],
        "TableNameMatches": [".*Watch$"],
        "JobIntervalMillis": "5000"
      },
    ]

Certificate Expiration

The status dashboard monitors the number of days until certificates expire, and this is controlled by JSON files. It's a simple property list, each entry containing the gauge name and the certificate property prefix. The interval between certificate checks can also be defined globally. For example:

    "CertificateMonitors": {
      "MonitoredCertificates": {
        "webcert": "StatusDashboard.tls",
        "authservercert": "authserver.tls",
        "configservercert": "configuration.server"
      }
    },
    "CertificateJobIntervalHours": "1"

Standard Deephaven properties are then used to retrieve the certificate information. For the webcert example, the expected p12 file and passhprase file are defined by the StatusDashboard.tls value.

    StatusDashboard.tls.keystore=/etc/sysconfig/illumon.d/auth-user/webServices-keystore.p12
    StatusDashboard.tls.passphrase.file=/etc/sysconfig/deephaven/auth-user/.webapi_passphrase

Similar keystore and passphrase files are provided for the authservercert and configservercert entries. See public and private keys for more information on these files and properties.

The status dashboard also monitors the root certificate as defined by the following properties (shown with their default values):

tls.truststore=/etc/sysconfig/illumon.d/resources/truststore-iris.p12
tls.truststore.passphrase.file=/etc/sysconfig/illumon.d/resources/truststore_passphrase

Envoy Configuration

If you're using Envoy, the installer should set up any required properties for the status dashboard. Properties for the node exporter are not automatically added, but here's an example that can be added to iris-environment.prop. Each node in an Envoy cluster running a node exporter will need its own set of properties using its own FQDN.

[service.name=configuration_server] {
    envoy.xds.extra.routes.node_exporter1.host=<host's FQDN>
    envoy.xds.extra.routes.node_exporter1.port=9100
    envoy.xds.extra.routes.node_exporter1.prefix=/node_exporter/
    envoy.xds.extra.routes.node_exporter1.prefixRewrite=/
    envoy.xds.extra.routes.node_exporter1.tls=false
    envoy.xds.extra.routes.node_exporter1.exactPrefix=false
}

dhconfig