Data ingestion

Ensuring continuous and reliable data ingestion is paramount for any Deephaven deployment that depends on real-time data for critical operations. A resilient ingestion pipeline is designed to prevent data loss, minimize service interruptions, and preserve data integrity, even in the face of component failures or network disruptions. This document explores Deephaven's core data ingestion models—file-based and message bus—and details the strategies for configuring high availability, including DIS failover groups and redundant persistent queries. It also covers common failure scenarios, recovery mechanisms, and best practices for monitoring and data reconciliation. By understanding these principles, administrators can build and maintain robust data ingestion pathways that form a critical layer of overall system resilience.

Core ingestion models and redundancy

Streaming data ingestion in Deephaven typically follows two models.

File-based ingestion

In this model, a custom process subscribes to a data stream.

  1. The custom process uses a Deephaven logger class to append events to binary log files.
  2. A Deephaven tailer streams blocks of binary data from the binary log files to a Data Import Server (DIS).
  3. The DIS writes events to disk and publishes them to ticking table clients.

For this model, redundancy can be set up at several levels. Usually, the logger process and tailer are local to each other and co-located on a system that has low-latency access to the source data stream. One option is to configure multiple DIS processes, at a minimum on multiple nodes of a single installation, but also possibly in other installations (such as a production backup or DR site), and have a single tailer send data to these multiple destinations. A more resilient option is to have a separate logger and tailer system for each DIS destination, so that the entire data stream is redundant back to its source (which probably has its own fault tolerance).

Message bus ingestion

In this model, some process produces data to a Kafka or Solace topic.

  1. A Kafka or Solace ingester running in a merge server PQ subscribes to the topic.
  2. The ingester receives events from the topic, writes them to disk, and publishes them to ticking table clients.

For this model, Kafka and Solace both have their own fault tolerance capabilities which can be configured to ensure uninterrupted processing of the topic. On the Deephaven side, multiple ingester PQs can be configured on multiple merge servers in one or several installations.

Configuring high availability

DIS failover groups

When using multiple DIS services or ingester PQs feeding data into a single Deephaven installation, failover groups provide automatic round-robin load balancing and failover in case of service loss. This configuration ensures continuous data ingestion even when individual DIS instances become unavailable.

Key resilience requirements:

  • Each DIS instance needs its own data storage for this configuration.
  • All DIS instances in a group must handle the same data, meaning they must have identical claims or filters.
  • The table data service automatically redirects requests to available DIS instances when failures occur.

See failover groups for detailed setup instructions and YAML examples.

Redundant ingester PQs

For message bus ingestion, the recommended approach is to use Persistent Query redundancy and failover with replicas and spares. This provides automatic load balancing across identical ingester PQs and seamless failover when individual workers fail.

Key benefits:

  • Automatic user assignment across replica slots using configurable assignment policies.
  • Immediate spare activation when a replica fails.
  • Compartmentalized failures that don't cascade to other replicas.
  • Dynamic capacity adjustment without restart.

Alternatively, you can manually run multiple identical ingester PQs on different merge servers subscribing to the same Kafka or Solace topic. However, your application logic must ensure only one actively processes data at a time to avoid duplication, or that the data is idempotent.

Failure and recovery scenarios

DIS or merge server failure

If a DIS instance within a failover group becomes unavailable, the routing configuration automatically redirects traffic to the other DIS instances in the group. Since each DIS maintains its own copy of intraday data, there is no data loss for new incoming data. When the failed DIS returns, it will catch up on any data it missed if the source (such as the binary logs and tailer that sends them) is still available.

Tailer or logger process failure

The Deephaven tailer-to-DIS interface is designed to be resilient. If the tailer process fails and is restarted, it will resume reading from the last checkpoint processed by the DIS, preventing data gaps.

Network disruption

Both the binary log files used by tailers and the topics in message brokers like Kafka act as persistent buffers. If a DIS or ingester PQ is temporarily unable to receive data due to a network error, the data accumulates files or topics. Once connectivity is restored, the DIS or PQ will process the backlog of data, ensuring no data is lost.

End-of-day merging and data reconciliation

For either ingestion model, the intraday data received may be merged to historical at some point (e.g., end-of-day).

If, for example, there are DIS1 and DIS2 that both provide a particular ticking table: on a normal day, the data from DIS1 would be merged, and the data from DIS2 would be discarded. On a day when DIS1 had failed, though, the partial data from DIS1 would be discarded, and the complete data from DIS2 would be merged instead. In the extremely unlikely event that both DIS1 and DIS2 were offline for part of the day, manual work might be needed to consolidate and deduplicate the data set of their combined partial tables, and then a merge could be run from that. This would not necessarily be required, though, because tailer binary logs and Kafka and Solace topics persist data while the DIS is unable to receive it. A DIS outage would normally result in the DIS catching up when it was brought back online, resulting in no missing data by the end of the day.

Monitoring and best practices

  • Monitoring: Actively monitor the health of your ingestion components. For message queues, track consumer lag to ensure your ingesters are keeping up with the data stream. For DIS, monitor table update frequencies and error logs.
  • Testing: Regularly test your failover procedures in a non-production environment to ensure they work as expected.
  • Idempotency: Whenever possible, design your data ingestion and processing logic to be idempotent. This means that processing the same message multiple times will not have an adverse effect, which simplifies recovery from certain failure modes.