Troubleshooting Envoy

This guide provides steps for troubleshooting common issues with Envoy when used as a front proxy for Deephaven.

General Diagnostic Checklist

Before diving into specific error codes, follow these initial steps to quickly assess the state of your Envoy instance. This methodical approach can often pinpoint the issue right away.

  1. Is the Envoy container running?
    sudo docker ps -f name=deephaven_envoy
    
  2. Are there any obvious errors in the startup logs?
    sudo docker logs deephaven_envoy
    
  3. Are all backend clusters healthy?
    curl http://localhost:8001/clusters
    
    Look for health_flags::healthy on all xds_cluster members.
  4. Is the configuration loaded correctly?
    curl http://localhost:8001/config_dump
    
    Check that the listeners, routes, and clusters match your expectations.

Using the Admin Interface

The Envoy admin interface is a powerful tool for debugging. By default, it is accessible on port 8001. You can use curl to inspect the configuration, view statistics, and more.

  • /config_dump: Shows the entire loaded configuration. This is useful for verifying that your envoy3.yaml and dynamic xDS updates have been applied correctly. You can filter it for specific resources, like routes (?resource=rds_config).
  • /clusters: Provides a detailed status of all upstream clusters, including IP addresses, health status, and connection statistics. This is the best way to check if Envoy can connect to the backend Deephaven services.
  • /stats: Outputs a large number of performance metrics. You can use grep to find specific stats, like upstream_cx_total for connection counts or http.downstream_rq_5xx for server errors.
  • /server_info: Displays the running Envoy version and its uptime, which is useful for confirming that a restart was successful.

Common Issues and Resolutions

Connection Refused

  • Symptom: Your browser or client shows a "Connection Refused" error when trying to connect to the Envoy port (e.g., 8000).
  • Cause: This typically means the Envoy process is not running or not listening on the correct port.
  • Troubleshooting Steps:
    1. Verify that the Envoy process or container is running using the checklist above.
    2. Check the Envoy logs for startup errors, such as a port conflict or a syntax error in the configuration file.
    3. Ensure no firewall rules on the host or network are blocking access to the port.

503 Service Unavailable

  • Symptom: You receive a 503 Service Unavailable error. This is often accompanied by no healthy upstream messages in the logs.
  • Cause: Envoy is running but cannot establish a healthy connection to the backend Deephaven services.
  • Troubleshooting Steps:
    1. Use the /clusters admin endpoint to identify which cluster is unhealthy.
    2. Verify that the backend Deephaven services (e.g., web-api, xds_service) are running and accessible from the Envoy host.
    3. Check for network connectivity issues (e.g., firewall rules, incorrect IP addresses in envoy3.yaml).

404 Not Found

  • Symptom: You receive a 404 Not Found error for a specific URL.
  • Cause: Envoy is running and connected, but the requested URL path does not match any configured route.
  • Troubleshooting Steps:
    1. Verify the URL you are trying to access is correct.
    2. Dump the route configuration to ensure the routes are correctly defined and loaded from the Deephaven RDS.
      curl http://localhost:8001/config_dump?resource=rds_config
      
    3. Check the Deephaven Configuration Server logs to ensure it is correctly publishing routes to Envoy.

WebSocket Connection Failures

  • Symptom: The Deephaven Web UI loads, but you cannot open a query console, or data does not update in real-time. Browser developer tools show a failed WebSocket handshake.
  • Cause: The WebSocket upgrade request is being blocked or misconfigured.
  • Troubleshooting Steps:
    1. Verify that the upgrade_configs section is present in the http_connection_manager filter in your envoy3.yaml file.
    2. Check for any intermediate network devices (like corporate firewalls or other proxies) between the client and Envoy that might be blocking WebSocket traffic.
    3. Inspect the Envoy logs for errors related to upgrade failure.

TLS/SSL Certificate Issues

  • Symptom: The browser shows a security warning (e.g., NET::ERR_CERT_AUTHORITY_INVALID), or connections fail with a TLS handshake error.
  • Cause: The TLS certificate is not correctly configured, trusted, or presented by Envoy.
  • Troubleshooting Steps:
    1. Verify that the paths to your TLS certificate (fullchain.pem) and private key (privkey.pem) in the docker run command's volume mounts are correct.
    2. Ensure that the files have the correct permissions and are readable by the user ID that Envoy is running as inside the container (e.g., 9002).
    3. Use a command-line tool like openssl to inspect the certificate that Envoy is presenting:
      openssl s_client -connect your-envoy-host:8000