Skip to main content
Version: Python

An Overview of Deephaven's Architecture

Deephaven's power is largely due to the concept that everything is a table. Think of Deephaven tables like Pandas DataFrames, except that they support real-time operations! The Deephaven table is the key abstraction that unites static and real-time data for a seamless, integrated experience. This section will discuss the conceptual building blocks of Deephaven tables before diving into some real code.

The Deephaven Table API

If you're familiar with tabular data (think Excel, Pandas, and more), Deephaven will feel natural to you. In addition to representing data as tables, you must be able to manipulate, transform, extract, and analyze the data to produce valuable insight. Deephaven represents transformations to the data stored in a table as operations on that table. These table operations can be chained together to generate a new perspective on the data:

from deephaven import new_table
from deephaven.column import string_col, int_col

source_table = new_table(
[string_col("Group", ["A", "B", "A", "B"]), int_col("Data", [1, 2, 3, 4])]
)

new_table = (
source_table.update_view("NewData = Data * 2")
.sum_by("Group")
.rename_columns(["DataSum=Data", "NewDataSum=NewData"])
)

This paradigm of representing data transformations with methods that act on tables is not new. However, Deephaven takes this a step further. In addition to its large collection of table operations, Deephaven provides a vast library of efficient functions and operations, accessible through query strings that get invoked from within table operations. These query strings, once understood, are a superpower that is totally unique to Deephaven. More on these later.

Collectively, the representation of data as tables, the table operations that act upon them, and the query strings used within table operations are known as the Deephaven Query Language, or DQL for short. The data transformations written with DQL (like the code above) are called Deephaven queries. The fundamental principle of the Deephaven experience is this:

Deephaven queries are unaware and indifferent to whether the underlying data source is static or streaming.

The magic here is hard to overstate. The same queries written to parse a simple CSV file can be used to analyze a similarly fashioned living dataset, evolving at millions of rows per second, without changing the query itself. Real-time analysis doesn't get simpler.

Immutable schema, mutable data

When Deephaven tables are created, they follow a well-defined recipe that cannot be modified after creation. This means that the number of columns in a table, the column names, and their data types - collectively known as the table's schema - are immutable. Even so, the data stored in a table can change. Here's a simple example that uses time_table:

from deephaven import time_table

# Create a table that ticks once per second - defaults to a single "Timestamp" column
ticking_table = time_table("PT1s")

img

New data is added to the table each second, but the schema stays the same.

Some queries appear to modify a table's schema. They may add or remove a column, modify a column's type, or rename a column. For example, the following query appears to rename the Timestamp column in the table to RenamedTimestamp:

# Apply 'rename_columns' table operation to ticking_table
ticking_table = ticking_table.rename_columns("RenamedTimestamp=Timestamp")

img

Since the schema of ticking_table is immutable, its columns cannot be renamed. Instead, Deephaven creates a new table with a new schema. Deephaven is smart about creating the new table so that only a tiny amount of additional memory is used to make it. An important consequence of this design is that Deephaven does not support in-place operations:

# In-place operation - DON'T DO THIS! ticking_table is not changed.
ticking_table.update_view("NewColumn = 1")

See how that operation has no effect. Instead, always assign the output of any table operation to a new table, even if it has the same name:

ticking_table = ticking_table.update_view("NewColumn = 1")
new_ticking_table = ticking_table.update_view("RowIndex = ii")

The engine

The Deephaven engine is the powerhouse that implements everything you've seen so far. The engine is responsible for processing streaming data as efficiently as possible, and all Deephaven queries must flow through the engine.

Internally, the engine represents queries as directed acyclic graphs (DAGs). This representation is key to Deephaven's efficiency. When a parent (upstream) table changes, a change summary is sent to child (downstream) tables. This compute-on-deltas model ensures that only updated rows in child tables are re-evaluated for each update cycle and allows Deephaven to process millions of rows per second.

For the sake of efficiency and scalability, the engine is implemented mostly in Java. Although you primarily interface with tables and write queries from Python, all of the underlying data in a table is stored in Java data structures, and all low-level operations are implemented in Java. Most Deephaven users will interact with Java at least a little bit, and there are important things to consider when passing data between Python and Java.