Deephaven's table types
Deephaven tables are the core data structures supporting Deephaven's static and streaming capabilities. Deephaven implements several specialized table types that differ in how streaming data is stored and processed in the engine, how the UI displays data to the user, and how some downstream operations behave. This document covers static tables, standard streaming tables, and the four specialized streaming table types: append-only, add-only, blink, and ring.
Table type summaryβ
This guide discusses the unique properties of each of Deephaven's table types. The main points are summarized in the following table.
Table type | Index-based ops, special variables | Consistent inter-cycle row ordering | Bounded memory usage | Inter-cycle data persistence |
---|---|---|---|---|
Static | β | π« | β | π« |
Standard | β | β | β | β |
Append-only | β | β | β | β |
Add-only | β | β | β | β |
Blink | β | π« | β | β |
Ring | β | β | β | β |
The rest of this guide explores each of these table types, their properties, and the consequences of those properties.
Static tablesβ
Static tables are the simplest types of Deephaven tables and are analogous to Pandas Dataframes, PyArrow Tables, and many other tabular representations of data. They represent data sources that do not update, and therefore do not support any of Deephaven's streaming capabilities.
Because static tables do not update, they have the following characteristics:
- Index-based operations are fully supported. The row indices of a static table always range from
0
toN-1
, so operations can depend on these values to be stable. - Operations that depend on or modify external state can be used with static tables. Stateful operations can present problems for some types of streaming tables.
- The use of special variables is fully supported. Deephaven's special variables
i
andii
represent row indices of a table asint
orlong
types, respectively. These variables are guaranteed to have consistent values in static tables.
Static tables can be created by reading from a static data source, such as CSV, Iceberg, Parquet, SQL. Or, they can be created with Deephaven's table creation functions, like new_table
or empty_table
. This example uses empty_table
to construct a static table:
from deephaven import empty_table
# create a static table with 10 rows and 2 columns
t = empty_table(10).update(["IntIdx = i", "LongIdx = ii"])
- t
Check whether a table is a static table with the is_refreshing
property. This property will be False
for static tables:
print(t.is_refreshing)
- Log
Any streaming table can be converted to a static table by taking a snapshot
. This will produce a static copy of the streaming table at the moment in time the snapshot is taken:
from deephaven import time_table
# create a streaming table with time_table
t = time_table("PT1s")
# at some point in the future, take a snapshot of the streaming table
t_snapshot = t.snapshot()
Verify that the snapshot is static with is_refreshing
:
print(t.is_refreshing)
print(t_snapshot.is_refreshing)
- Log
Standard streaming tablesβ
Most streaming Deephaven tables are "standard" tables. These are the most flexible and least constrained types of tables, with the following key properties:
- Rows can be added, modified, deleted, or reindexed at any time, at any position in the table.
- The table's size can grow without bound.
These properties have some important consequences:
- Index-based operations, stateful operations, or operations using special variables may yield results that change unexpectedly between update cycles. By default, Deephaven throws an error in these cases.
- The rows in standard tables are not guaranteed to maintain their original order of arrival. Operations should not assume anything about the order of data in a standard table.
- Standard tables may eventually result in out-of-memory errors in data-intensive applications.
These properties are not ideal for every use case. Deephaven's specialized table types provide alternatives.
Specialization 1: Append-onlyβ
Append-only tables are highly-constrained types of tables. They have the following key properties:
- Rows can only be added to the end of the table.
- Once a row is in an append-only table, it cannot be modified, deleted, or reindexed.
- The table's size can grow without bound.
These properties yield the following consequences:
- Append-only tables guarantee that old rows will not change, move, or disappear, so index-based operations, stateful operations, or operations using special variables are guaranteed to yield results that do not change unexpectedly between update cycles.
- The rows in append-only tables are guaranteed to maintain their original order of arrival.
- Append-only tables may eventually result in out-of-memory errors in data-intensive applications.
Append-only tables are useful when the use case needs a complete and ordered history of every record ingested from a stream. They are safe and predictable under any Deephaven query and are guaranteed to retain all the data they've seen.
Specialization 2: Add-onlyβ
Add-only tables are relaxed versions of append-only tables. They have the following key properties:
- Rows can only be added to the table, but they may be added at any position in the table.
- Existing rows cannot be deleted or modified, but may be reindexed.
- The table's size can grow without bound.
These properties yield the following consequences:
- Index-based operations, stateful operations, or operations using special variables may yield results that change unexpectedly between update cycles. By default, Deephaven throws an error in these cases.
- The rows in add-only tables are not guaranteed to maintain their original order of arrival. Operations should not assume anything about the order of data in an add-only table.
- Add-only tables may eventually result in out-of-memory errors in data-intensive applications.
Specialization 3: Blinkβ
Blink tables keep only the set of rows received during the current update cycle. Users can create blink tables when ingesting Kafka streams, creating time tables, or using Table Publishers. They have the following key properties:
- The table only consists of rows added in the previous update cycle.
- No rows persist for more than one update cycle.
- The table's size is bounded by the size of the largest update it receives.
These properties have the following consequences:
- Since blink tables see a brand new world at every update cycle, index-based operations, stateful operations, or operations using special variables are guaranteed to yield results that do not change unexpectedly between update cycles.
- The entire table changes every update cycle, so preserving row order from cycle to cycle is irrelevant.
- Blink tables can only cause memory problems if a single update receives more data than fits in available RAM. This is unusual, but not impossible.
Blink tables are the default table type for Kafka ingestion within Deephaven because they use little memory. They are most useful for low-memory aggregations, deriving downstream tables, or using programmatic listeners to react to data.
Check whether a table is a blink table with the is_blink
property:
from deephaven import time_table
# time_table can be used to create a blink table with blink_table=True
t = time_table("PT0.2s", blink_table=True).update("X = ii")
print(t.is_blink)
- Log
Specialized semantics for blink tablesβ
Aggregation operations such as agg_by
and count_by
operate with special semantics on blink tables, allowing the result to aggregate over the entire observed stream of rows from the time the operation is initiated. That means, for example, that a sum_by
on a blink table will contain the resulting sums for each aggregation group over all observed rows since the sum_by
was applied, rather than just the sums for the current update cycle. This allows for aggregations over the full history of a stream to be performed with greatly reduced memory costs when compared to the alternative strategy of holding the entirety of the stream as an in-memory table.
Here is an example that demonstrates a blink table's specialized aggregation semantics:
from deephaven import time_table
# create blink table with two groups of data to sum
t = time_table("PT0.1s", blink_table=True).update(
["X = ii", "Group = ii % 2 == 0 ? `A` : `B`"]
)
# note that the sums continue to grow by including all previous data
t_sum = t.view(["X", "Group"]).sum_by("Group")
These special aggregation semantics may not always be desirable. Disable them by calling remove_blink
on the blink table:
t_no_blink = t.remove_blink()
# sum is no longer over all data, but only over data in this cycle
t_sum_no_blink = t_no_blink.view(["X", "Group"]).sum_by("Group")
Most operations on blink tables behave exactly as they do on other tables (see the exclusions below); that is, added rows are processed as usual. For example, select
on a blink table will contain only the newly added rows from the current update cycle.
Because Deephaven does not need to keep all the history of rows read from the input stream in memory, table operations on blink tables may require less memory.
To disable blink table semantics, use remove_blink
, which returns a child table that is identical to the parent blink table in every way, but is no longer marked for special blink table semantics. The resulting table will still exhibit the βblinkβ table update pattern, removing all previous rows on each cycle, and thus only containing βnewβ rows.
Unsupported operationsβ
Attempting to use the following operations on a blink table will raise an error:
group_by
partition_by
partition_agg_by
head_pct
tail_pct
slice
slice_pct
agg_by
if eithergroup
orpartition
is used.rollup
ifincludeConstituents=true
.tree
Create an append-only table from a blink tableβ
It is common to create an append-only table from a blink table to preserve the entire data history. Use blink_to_append_only
to do this:
from deephaven import time_table
from deephaven.stream import blink_to_append_only
t = time_table("PT1s", blink_table=True)
# get an append-only table from t
t_append_only = blink_to_append_only(t)
Create a blink table from an add-only tableβ
It may be useful to create a blink table from an add-only table. This will only provide real benefit if the upstream add-only table is not fully in-memory. In this case, the operation will not fail, but there will be no memory savings. Use add_only_to_blink
to accomplish this:
from deephaven import time_table
from deephaven.stream import add_only_to_blink
# t is append-only, which is a subset of add-only
t = time_table("PT0.5s")
# get a blink table from t
t_blink = add_only_to_blink(t)
Specialization 4: Ringβ
Ring tables are like standard tables, but are limited in how large they can grow. They have the following key properties:
- Rows can be added, modified, deleted, or reindexed at any time, at any position in the table.
- The table's size is strictly limited to the latest
N
rows, set by the user. As new rows are added, old rows are discarded so as to not exceed the maximum limit.
These properties have the following consequences:
- Index-based operations, stateful operations, or operations using special variables may yield results that change unexpectedly between update cycles. By default, Deephaven throws an error in these cases.
- The rows in ring tables are not guaranteed to maintain their original order of arrival. Operations should not assume anything about the order of data in a ring table.
- Ring tables will not grow without bound and are strictly limited to a maximum number of rows. Once that limit is reached, the oldest rows are discarded and deleted from memory.
Ring tables are semantically the same as standard tables, and they do not get specialized aggregation semantics like blink tables do. However, operations use less memory because ring tables dispose of old data.
Create a ring table from a blink tableβ
It is common to create a ring table from a blink table to preserve some data history, but not all. Use ring_table
to do this.
The following example creates a ring table that holds the five most recent observations from a blink table:
from deephaven import time_table, ring_table
t = time_table("PT0.5s", blink_table=True)
# get ring table from t that holds last five rows
t_ring = ring_table(parent=t, capacity=5)
Create a ring table from an append-only tableβ
Creating a ring table from an append-only table does not give the memory savings that ring tables are useful for.
Ring tables can also be created from append-only tables using ring_table
. This is a less common use case because the typical memory savings that ring tables afford is lost. If there is an append-only table anywhere in the equation, it can grow until it eats up all available memory. A downstream ring table will only appear to save on memory, and is effectively equivalent to applying a tail
operation to an append-only table.
This example creates a ring table with a 5-row capacity from a simple append-only time table:
from deephaven import time_table, ring_table
# t is an append-only table
t = time_table("PT0.5s")
# get ring table from t that holds last three rows
t_ring = ring_table(parent=t, capacity=5)
If the source append-only table already has rows in it when ring_table
is called, the resulting ring table will include those rows by default:
from deephaven import empty_table, time_table, ring_table, merge
# create append-only table that starts with five rows
t_static = empty_table(5).update("X = ii")
t_dynamic = time_table("PT1s").update("X = ii + 5").drop_columns("Timestamp")
t = merge([t_static, t_dynamic])
# get ring table from t that holds last ten rows
t_ring_with_initial = ring_table(parent=t, capacity=10)
To disable this behavior, set initialize = False
:
t_ring_without_initial = ring_table(parent=t, capacity=10, initialize=False)