Deephaven's Python package
The Deephaven Python package is how users interact with the Deephaven engine. Deephaven's Python API offers various cool ways for users to manipulate their data. This guide gives a brief overview of just what you can do.
What can you do with deephaven
?
In short: a lot. Here are a few cool things that only scratch the surface:
Create tables from scratch
Tables can be created from scratch in a variety of ways. The code below creates tables with empty_table
, new_table
, and time_table
.
from deephaven.column import int_col, double_col, string_col
from deephaven import empty_table, new_table, time_table
t1_static = empty_table(50).update(["X = 0.1 * i", "Y = sin(X)", "Z = cos(X)"])
t2_static = new_table(
[
string_col(
"Strings",
["A", "Hello world!", "The quick brown fox jumps over the lazy dog."],
),
int_col("Integers", [5, 17, -33422]),
double_col("Doubles", [3.14, -1.1111, 99.999]),
]
)
t_ticking = time_table("PT1s").update("X = ii")
- t1_static
- t2_static
- t_ticking
Ingest data from external sources
Ingest data into tables from external sources such as CSV, Apache Parquet, Kafka, and SQL:
# Consume data from CSV and Parquet
from deephaven.parquet import read as read_pq
from deephaven import read_csv
t_from_csv = read_csv(
"https://media.githubusercontent.com/media/deephaven/examples/main/Insurance/csv/insurance.csv"
)
t_from_parquet = read_pq("/data/examples/SensorData/parquet/SensorData_gzip.parquet")
- t_from_csv
- t_from_parquet
# Consume data from Kafka
from deephaven.stream.kafka import consumer as kc
from deephaven import dtypes as dht
t = kc.consume(
{"bootstrap.servers": "redpanda:9092"},
"test.topic",
table_type=kc.TableType.append(),
key_spec=kc.KeyValueSpec.IGNORE,
value_spec=kc.simple_spec("Command", dht.string),
)
- t
# Consume data from SQL
from deephaven.dbc import read_sql
import os
my_query = "SELECT t_ts as Timestamp, CAST(t_id AS text) as Id, " +
"CAST(t_instrument as text) as Instrument, " +
"t_exchange as Exchange, t_price as Price, t_size as Size " +
"FROM CRYPTO TRADES"
username = os.environ["POSTGRES_USERNAME"]
password = os.environ["POSTGRES_PASSWORD"]
url = os.environ["POSTGRES_URL"]
port = os.environ["POSTGRES_PORT"]
sql_uri = f"postgresql://{url}:{port}/postgres?user={username}&password={password}"
crypto_trades = read_sql(conn=sql_uri, query=my_query, driver="connectorx")
Export tables to Parquet, CSV, Kafka, and more
Table data can be exported to a wide variety of formats including Parquet, CSV, Kafka, Uniform Resource Identifiers (URIs), and many more. The following code block specifically writes a table to CSV and Parquet for later use.
from deephaven.parquet import write as write_pq
from deephaven import empty_table
from deephaven import write_csv
my_table = empty_table(100).update(["X = 0.1 * i", "Y = sin(X)", "Z = cos(X)"])
write_pq(my_table, "/data/my_table.parquet")
write_csv(my_table, "/data/my_table.csv")
- my_table
Filter data
Data can be filtered out of tables based on conditions being met or by row numbers, such as those at the top or bottom of a table.
from deephaven import empty_table
def filter_func(y: float, z: str) -> bool:
if y < 2 or y > 8 and z == "B":
return True
else:
return False
t = empty_table(50).update(
["X = i", "Y = randomDouble(0.0, 10.0)", "Z = (X % 2 == 0) ? `A` : `B`"]
)
t_filtered = t.where(["Y > 5.0", "Z = `B`"])
t_function_filtered = t.where("filter_func(Y, Z)")
t_head = t.head(20)
t_tail = t.tail(5)
- t_filtered
- t_function_filtered
- t_head
- t_tail
- t
Calculate cumulative and rolling aggregations
Aggregations can be cumulative or windowed (by either time or rows).
from deephaven.updateby import rolling_avg_tick
from deephaven import empty_table
from deephaven import agg
t = empty_table(100).update(
[
"Sym = (i % 2 == 0) ? `A` : `B`",
"X = randomDouble(0.0, 10.0)",
"Y = randomDouble(50.0, 150.0)",
]
)
t_agg = t.agg_by(aggs=agg.avg(["AvgX = X", "AvgY = Y"]), by="Sym")
t_updateby = t.update_by(
ops=rolling_avg_tick(cols=["RollingAvgX = X", "RollingAvgY = Y"], rev_ticks=5),
by="Sym",
)
- t_agg
- t_updateby
- t
Join tables together
Deephaven offers a wide variety of ways to join tables together. Joins can be exact or relational or inexact or time-series. Even ticking tables can be joined with no extra work.
from deephaven import new_table
from deephaven.column import string_col, int_col
from deephaven.constants import NULL_INT
t_left = new_table(
[
string_col(
"LastName", ["Rafferty", "Jones", "Steiner", "Robins", "Smith", "Rogers"]
),
int_col("DeptID", [31, 33, 33, 34, 34, NULL_INT]),
string_col(
"Telephone",
[
"(303) 555-0162",
"(303) 555-0149",
"(303) 555-0184",
"(303) 555-0125",
None,
None,
],
),
]
)
t_right = new_table(
[
int_col("DeptID", [31, 33, 34, 35]),
string_col("DeptName", ["Sales", "Engineering", "Clerical", "Marketing"]),
string_col("DeptManager", ["Martinez", "Williams", "Garcia", "Lopez"]),
int_col("DeptGarage", [33, 52, 22, 45]),
]
)
t = t_left.natural_join(table=t_right, on=["DeptID"])
- t
- t_left
- t_right
Convert to and from NumPy and Pandas
NumPy and Pandas are two of Python's most popular packages. They get used often in Deephaven Python queries. So, there are mechanisms available that make converting between tables and Python data structures a breeze.
from deephaven import pandas as dhpd
from deephaven import numpy as dhnp
from deephaven import empty_table
import pandas as pd
import numpy as np
t = empty_table(10).update("X = randomDouble(-10.0, 10.0)")
np_arr_from_table = dhnp.to_numpy(t)
df_from_table = dhpd.to_pandas(t)
print(np_arr_from_table)
print(df_from_table)
t_from_np = dhnp.to_table(np_arr_from_table, cols="X")
t_from_pd = dhpd.to_table(df_from_table)
- t
- t_from_np
- t_from_pd
- Log
What's in the deephaven
package
The deephaven
package is used to interact with tables and other Deephaven objects in queries. Each of the following subsections discusses a single submodule within the deephaven
package. They are presented in alphabetical order.
agg
The agg
submodule defines the combined aggregations that are usable in queries.
appmode
The appmode
submodule supports writing Deephaven application mode Python scripts.
arrow
The arrow
submodule supports conversions to and from PyArrow tables and Deephaven tables.
barrage
The barrage
submodule provides functions for accessing resources on remote Deephaven servers.
calendar
The calendar
submodule defines functions for working with business calendars.
column
The column
submodule implements columns in tables. It is primarily used when creating new tables.
constants
The constants
submodule defines the global constants, including Deephaven's special numerical values. Values include the maximum, minimum, NaN, null, and infinity values for different data types.
csv
The csv
submodule supports reading an external CSV file into a Deephaven table and writing a table to a CSV file.
dbc
The dbc
submodule enables users to connect to and execute queries on external databases in Deephaven.
dherror
The dherror
submodule defines a custom exception for the Deephaven Python package. See the triage errors guide for more information.
dtypes
The dtypes
submodule defines the data types supported by the Deephaven engine.
execution_context
The execution_context
submodule gives users the ability to directly manage the Deephaven query execution context on threads.
experimental
The experimental
submodule contains experimental features. Current experimental features include outer joins and AWS S3 support.
filters
The filters
submodule implements various filters that can be used in filtering operations.
html
The html
submodule supports exporting Deephaven data in HTML format.
jcompat
The jcompat
submodule provides Java compatibility support and convenience functions to create Java data structures from corresponding Python ones.
learn
The learn
submodule provides utilities for efficient data transfer between tables and Python objects. It serves as a framework for using machine learning libraries in Deephaven Python queries.
liveness_scope
The liveness_scope
submodule gives a finer degree of control over when to clean up unreferenced nodes in the query update graph instead of solely relying on garbage collection.
numpy
The numpy
submodule supports the conversion between Deephaven tables and NumPy arrays.
pandas
The pandas
submodule supports the conversion between Deephaven tables and Pandas DataFrames.
pandasplugin
The pandasplugin
submodule is used to display Pandas DataFrames as Deephaven tables in the UI.
parquet
The parquet
submodule supports reading external Parquet files into Deephaven tables and writing Deephaven tables out as Parquet files.
perfmon
The perfmon
submodule contains tools to analyze performance of the Deephaven system and queries.
plot
The plot
submodule contains the framework for Deephaven's built-in plotting API.
plugin
The plugin
submodule contains the framework for using plugins and creating your own plugins in Deephaven.
query_library
The query_library
submodule allows users to import Java classes or packages into the query library, which they can then use in queries.
replay
The replay
submodule provides support for replaying historical data.
server
The server
submodule contains the framework for running Python operations from within a JVM.
stream
The stream
submodule contains Deephaven's Apache Kafka integration, Table Publisher, and a utility for converting between table types.
table
The table
submodule implements the Deephaven table, partitioned table, and partitioned table proxy objects.
table_factory
The table_factory
submodule provides various ways to create Deephaven tables, including empty tables, input tables, new tables, function-generated tables, time tables, ring tables, and DynamicTableWriter.
table_listener
The table_listener
submodule implements Deephaven's table listeners.
time
The time
submodule defines functions for handling Deephaven date-time data.
update_graph
The update_graph
submodule provides access to the update graph's locks that must be acquired to perform certain table operations.
updateby
The updateby
submodule supports building various cumulative and rolling aggregations to be used in the update_by
table operation.
uri
The uri
submodule implements tools for resolving Uniform Resource Identifiers (URIs) into objects.