Overview
Deephaven provides an efficient and user-friendly way to work with both streaming and batch data. To use this data, the first step is to ingest it into Deephaven. How can you get batch data into Deephaven in the first place? What about streaming data? How can you make data available to anyone in a Deephaven cluster? What does Deephaven Enterprise offer for data ingestion that Deephaven Community Core does not? This page has the answers to all of those questions, along with links to more detailed pages on the various methods that are available.
Deephaven Enterprise
Deephaven Enterprise is the full-featured version of the Deephaven platform. It's a comphrehensive system designed for production use, large-scale processing, reliability, orchestration, and so much more. It provides data ingestion mechanisms that persist data across clusters and sessions, making it easier than ever to work with data.
Deephaven Database
Deephaven Enterprise includes a built-in database that allows for the persistence of data across clusters and sessions. There are some key concepts worth noting in order to understand what the database is and does. It:
- Makes tables available to any worker in a Deephaven cluster through the API.
- Supports both streaming and static datasets.
- Allows users to add tables to the database programmatically.
Deephaven's database provides a robust solution for managing batch and streaming data together in a distributed environment. Any worker in a Deephaven cluster can access the tables stored in the database. For example, the following query pulls historical stock trade data from August 23, 2017, from the LearnDeephaven
namespace:
// Get stock trade data from August 23, 2017 from the LearnDeephaven namespace
trades = db.historicalTable("LearnDeephaven", "StockTrades").where("Date=`2017-08-23`")
# Get stock trade data from August 23, 2017 from the LearnDeephaven namespace
trades = db.historical_table("LearnDeephaven", "StockTrades").where("Date=`2017-08-23`")
Data Import Server (DIS)
The Data Import Server (DIS) is a central agent for ingesting streaming data and then serving it to queries. The DIS takes data from external sources and converts it to Deephaven-supported formats in order to make it available to the Deephaven database. If streaming data is available in the database, it comes from a DIS. Deephaven deployments have a centralized DIS, but you can also use individual workers as Data Import Servers (DISes).
Streaming binary logs (binlogs)
Streaming binary logs (commonly referred to as binlogs) are integral to handling streaming intraday data in Deephaven. Binlogs are a powerful tool for persisting streaming data and making it available across an entire Deephaven cluster.
Binlogs work by:
- Getting streaming data from an application and writing it to a proprietary row-oriented format optimized for ingesting and processing high-frequency streaming data.
- Allowing a process called the tailer to read binlogs and send them to the DIS, where they get stored persistently.
- The DIS makes that data available to workers via the Deephaven database as live intraday data.
For an example, see the Streaming binlogs crash course.
Streaming imports
Streaming imports can be thought of as in-worker Data Import Servers (DISes). They are an alternative to streaming binlogs that make streaming data available to all workers in a Deephaven Enterprise cluster. Use streaming imports when you do not want to persist streaming data via binlogs.
A streaming import is a query that ingests streaming data from an external source and writes it to a DIS running in the same process. The in-worker DIS makes the data available to the Deephaven database as live intraday data.
Streaming imports support a variety of data sources, including:
- Kafka
- Solace
- Websockets
- Derived table writers
See the Streaming Kafka crash course for an example of this workflow using Kafka.
Batch imports
For streaming data, binlogs and streaming imports are the way to go. But what if you have batch data that you want to ingest into Deephaven? Deephaven Enterprise has you covered there as well.
Batch imports are a method of ingesting data into the Deephaven database as intraday data. Users can import data from a variety of sources and formats, including:
Batch imports are performed in a two step process:
- Deploy a schema for the table. Schemas are XML files under source control in the Deephaven system that define the structure of the data.
- Persist the data to the Deephaven database. As an example, a batch CSV import is performed by creating a CSV Import [Persistent Query (PQ)]. Other data formats have their own import PQs.
The Deephaven UI makes creating and managing these batch import PQs easy. It provides a simple interface for configuring the import process, including selecting the data source, specifying the schema, and defining the import schedule.
For an example, see the Batch CSV crash course.
Deephaven Community Core
Deephaven Enterprise deployments have access to all of the features available in Deephaven Community Core, which is the free and open version of Deephaven. Enterprise clusters consist of "Core+" workers, which are Deephaven Community Core instances with additional Enterprise features, including those discussed in this guide.
Thus, an understanding of Community Core methods and features will greatly benefit Enterprise users. The following lists provide links to relevant data ingestion topics in Deephaven Community Core based on the language used: