Deephaven Community Core Quickstart
Deephaven Community Core can be installed with Docker or pip. Docker-installed Deephaven runs in a Docker container and requires Docker to be installed on your machine, while pip-installed Deephaven runs natively on your machine in a Python environment and requires the pip package manager. If you don't have a preference, we recommend starting with Docker.
1. Install and Launch Deephaven
With Docker
Install and launch Deephaven via Docker with a one-line command:
docker run --rm --name deephaven -p 10000:10000 -v data:/data --env START_OPTS="-Dauthentication.psk=YOUR_PASSWORD_HERE" ghcr.io/deephaven/server:latest
Replace "YOUR_PASSWORD_HERE" with a more secure passkey to keep your session safe.
For additional configuration options, see the install guide for Docker.
With Pip
For pip-installed Deephaven, we recommend using a Python virtual environment to decouple and isolate Python installs and associated packages.
To install Deephaven with pip, you must have Java installed on your computer. See this guide for OS-specific instructions.
To install Deephaven, install the deephaven-server
Python package:
pip3 install deephaven-server
Then, launch Deephaven:
deephaven server --jvm-args "-Xmx4g -Dauthentication.psk=YOUR_PASSWORD_HERE"
Replace "YOUR_PASSWORD_HERE" with a more secure passkey to keep your session safe.
For more advanced configuration options, see our pip installation guide. This includes an additional instruction needed for users on an M2 Mac.
If you prefer not to use Docker or pip, you can install Deephaven natively.
2. The Deephaven IDE
Navigate to http://localhost:10000/ and enter your password in the token field:
You're ready to go! The Deephaven IDE is a fully-featured scripting IDE. Here's a brief overview of some of its basic functionalities.
- Write and execute commands
Use this console to write and execute Python and Deephaven commands.
- Create new notebooks
Click this button to create new notebooks where you can write scripts.
- Edit active notebook
Edit the currently active notebook.
- Run entire notebook
Click this button to execute all of the code in the active notebook, from top to bottom.
- Run selected code
Click this button to run only the selected code in the active notebook.
- Save your work
Save your work in the active notebook. Do this often!
To learn more about the Deephaven IDE, check out the navigating the UI guide for a tour of the available menus and tools, and the accompanying guides on graphical column manipulation, the IDE chart-builder, and more.
Now that you have Deephaven installed and open, the rest of this guide will briefly highlight some key features of using Deephaven.
For a more exhaustive introduction to Deephaven and an in-depth exploration of our design principles and APIs, check out our Crash Course series.
3. Import static and streaming data
Deephaven empowers users to wrangle static and streaming data with ease. It supports ingesting data from CSV files, Parquet files, and Kafka streams.
Load a CSV
Run the command below inside a Deephaven console to ingest a million-row CSV of crypto trades. All you need is a path or URL for the data:
from deephaven import read_csv
crypto_from_csv = read_csv(
"https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/CryptoTrades_20210922.csv"
)
- crypto_from_csv
The table widget now in view is highly interactive:
- Click on a table and press Ctrl + F (Windows) or ⌘F (Mac) to open quick filters.
- Click the funnel icon in the filter field to create sophisticated filters or use auto-filter UI features.
- Hover over column headers to see data types.
- Right-click headers to access more options, like adding or changing sorts.
- Click the Table Options hamburger menu at right to plot from the UI, create and manage columns, and download CSVs.
Replay Historical Data
Ingesting real-time data is one of Deephaven's superpowers, and you can learn more about supported formats from the links at the end of this guide. However, streaming pipelines can be complicated to set up and are outside the scope of this discussion. For a streaming data example, we'll use Deephaven's Table Replayer to replay historical cryptocurrency data back in real time.
The following code takes fake historical crypto trade data from a CSV file and replays it in real time based on timestamps. This is only one of multiple ways to create real-time data in just a few lines of code. Replaying historical data is a great way to test real-time algorithms before deployment into production.
from deephaven import TableReplayer, read_csv
fake_crypto_data = read_csv(
"https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/FakeCryptoTrades_20230209.csv"
)
start_time = "2023-02-09T12:09:18 ET"
end_time = "2023-02-09T12:58:09 ET"
replayer = TableReplayer(start_time, end_time)
crypto_streaming = replayer.add_table(fake_crypto_data, "Timestamp")
replayer.start()
4. Working with Deephaven Tables
In Deephaven, static and dynamic data are represented as tables. New tables can be derived from parent tables, and data efficiently flows from parents to their dependents. See the concept guide on the table update model if you're interested in what's under the hood.
Deephaven represents data transformations as operations on tables. This is a familiar paradigm for data scientists using Pandas, Polars, R, Matlab and more. Deephaven's table operations are special - they are indifferent to whether the underlying data sources are static or streaming! This means that code written for static data will work seamlessly on live data.
There are a ton of table operations to cover, so we'll keep it short and give you the highlights.
Manipulating data
First, reverse the ticking table with reverse
so that the newest data appears at the top:
crypto_streaming_rev = crypto_streaming.reverse()
Many table operations can also be done from the UI. For example, right-click on a column header in the UI and choose Reverse Table.
Add a column with update
:
# Note the enclosing [] - this is optional when there is a single argument
crypto_streaming_rev = crypto_streaming_rev.update(["TransactionTotal = Price * Size"])
Use select
or view
to pick out particular columns:
# Note the enclosing [] - this is not optional, since there are multiple arguments
crypto_streaming_prices = crypto_streaming_rev.view(["Instrument", "Price"])
Remove columns with drop_columns
:
# Note the lack of [] - this is permissible since there is only a single argument
crypto_streaming_rev = crypto_streaming_rev.drop_columns("TransactionTotal")
Next, Deephaven offers many operations for filtering tables. These include where
, where_one_of
, where_in
, where_not_in
, and more.
The following code uses where
and where_one_of
to filter for only Bitcoin transactions, and then for Bitcoin and Ethereum transactions:
btc_streaming = crypto_streaming_rev.where("Instrument == `BTC/USD`")
etc_btc_streaming = crypto_streaming_rev.where_one_of(
["Instrument == `BTC/USD`", "Instrument == `ETH/USD`"]
)
Aggregating data
Deephaven's dedicated aggregations suite provides a number of table operations that enable efficient column-wise aggregations. These operations also support aggregations by group.
Use count_by
to count the number of transactions from each exchange:
exchange_count = crypto_streaming.count_by("Count", by="Exchange")
Then, get the average price for each instrument with avg_by
:
instrument_avg = crypto_streaming.view(["Instrument", "Price"]).avg_by(by="Instrument")
Find the largest transaction per instrument with max_by
:
max_transaction = (
crypto_streaming.update("TransactionTotal = Price * Size")
.view(["Instrument", "TransactionTotal"])
.max_by("Instrument")
)
While dedicated aggregations are powerful, they only enable you to perform one aggregation at a time. However, you often need to perform multiple aggregations on the same data. For this, Deephaven provides the agg_by
table operation and the deephaven.agg
Python module.
First, use agg_by
to compute the mean and standard deviation of each instrument's price, grouped by exchange:
from deephaven import agg
summary_prices = crypto_streaming.agg_by(
[agg.avg("AvgPrice=Price"), agg.std("StdPrice=Price")],
by=["Instrument", "Exchange"],
).sort(["Instrument", "Exchange"])
Then, add a column containing the coefficient of variation for each instrument, measuring the relative risk of each:
summary_prices = summary_prices.update("PctVariation = 100 * StdPrice / AvgPrice")
Finally, create a minute-by-minute Open-High-Low-Close table using the lowerBin
built-in function along with first
, max_
, min_
, and last
:
ohlc_by_minute = (
crypto_streaming.update("BinnedTimestamp = lowerBin(Timestamp, MINUTE)")
.agg_by(
[
agg.first("Open=Price"),
agg.max_("High=Price"),
agg.min_("Low=Price"),
agg.last("Close=Price"),
],
by=["Instrument", "BinnedTimestamp"],
)
.sort(["Instrument", "BinnedTimestamp"])
)
You may want to perform window-based calculations, compute moving or cumulative statistics, or look at pair-wise differences. Deephaven's update_by
table operation and the deephaven.updateby
Python module are the right tools for the job.
Compute the moving average and standard deviation of each instrument's price using rolling_avg_time
and rolling_std_time
:
import deephaven.updateby as uby
instrument_rolling_stats = crypto_streaming.update_by(
[
uby.rolling_avg_time("Timestamp", "AvgPrice30Sec=Price", "PT30s"),
uby.rolling_avg_time("Timestamp", "AvgPrice5Min=Price", "PT5m"),
uby.rolling_std_time("Timestamp", "StdPrice30Sec=Price", "PT30s"),
uby.rolling_std_time("Timestamp", "StdPrice5Min=Price", "PT5m"),
],
by="Instrument",
).reverse()
These statistics can be used to determine "extreme" instrument prices, where the instrument's price is significantly higher or lower than the average of the prices preceding it in the window:
instrument_extremity = instrument_rolling_stats.update(
[
"Z30Sec = (Price - AvgPrice30Sec) / StdPrice30Sec",
"Z5Min = (Price - AvgPrice5Min) / StdPrice5Min",
"Extreme30Sec = Math.abs(Z30Sec) > 1.645 ? true : false",
"Extreme5Min = Math.abs(Z5Min) > 1.645 ? true : false",
]
).view(
[
"Timestamp",
"Instrument",
"Exchange",
"Price",
"Size",
"Extreme30Sec",
"Extreme5Min",
]
)
There's a lot more to update_by
. See the user guide for more information.
Combining tables
Combining datasets can often yield powerful insights. Deephaven offers two primary ways to combine tables - the merge and join operations.
The merge
operation stacks tables on top of one-another. This is ideal when several tables have the same schema. They can be static, ticking, or a mix of both:
from deephaven import merge
combined_crypto = merge([fake_crypto_data, crypto_streaming]).sort("Timestamp")
The ubiquitous join operation is used to combine tables based on columns that they have in common. Deephaven offers many variants of this operation such as join
, natural_join
, exact_join
, and many more.
For example, read in an older dataset containing price data on the same coins from the same exchanges. Then, use join
to combine the aggregated prices to see how current prices compare to those in the past:
more_crypto = read_csv(
"https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/CryptoTrades_20210922.csv"
)
more_summary_prices = more_crypto.agg_by(
[agg.avg("AvgPrice=Price"), agg.std("StdPrice=Price")],
by=["Instrument", "Exchange"],
).sort(["Instrument", "Exchange"])
price_comparison = (
summary_prices.drop_columns("PctVariation")
.rename_columns(["AvgPriceFeb2023=AvgPrice", "StdPriceFeb2023=StdPrice"])
.join(
more_summary_prices,
on=["Instrument", "Exchange"],
joins=["AvgPriceSep2021=AvgPrice", "StdPriceSep2021=StdPrice"],
)
)
In many real-time data applications, data needs to be combined based on timestamps. Traditional join operations often fail this task, as they require exact matches in both datasets. To remedy this, Deephaven provides time series joins, such as aj
and raj
, that can join tables on timestamps with approximate matches.
Here's an example where aj
is used to find the Ethereum price at or immediately preceding a Bitcoin price:
crypto_btc = crypto_streaming.where(filters=["Instrument = `BTC/USD`"])
crypto_eth = crypto_streaming.where(filters=["Instrument = `ETH/USD`"])
time_series_join = (
crypto_btc.view(["Timestamp", "Price"])
.aj(crypto_eth, on="Timestamp", joins=["EthTime = Timestamp", "EthPrice = Price"])
.rename_columns(cols=["BtcTime = Timestamp", "BtcPrice = Price"])
)
To learn more about our join methods, see the guides: Join: Exact and Relational and Join: Time-Series and Range.
5. Plot data via query or the UI
Deephaven has a rich plotting API that supports updating, real-time plots. It can be called programmatically:
from deephaven.plot import Figure
btc_data = instrument_rolling_stats.where("Instrument == `BTC/USD`").reverse()
btc_plot = (
Figure()
.plot_xy("Bitcoin Prices", btc_data, x="Timestamp", y="Price")
.plot_xy("Rolling Average", btc_data, x="Timestamp", y="AvgPrice30Sec")
.show()
)
Or with the web UI:
Additionally, Deephaven supports an integration with the popular plotly-express library that enables real-time plotly-express plots.
6. Export data to popular formats
It's easy to export your data out of Deephaven to popular open formats.
To export a table to a CSV file, use the write_csv
method with the table name and the location to which you want to save the file. If you are using Docker, see managing Docker volumes for more information on how to save files to your local machine.
from deephaven import write_csv
write_csv(instrument_rolling_stats, "/data/crypto_prices_stats.csv")
If the table is dynamically updating, Deephaven will automatically snapshot the data before writing it to the file.
Similarly, for Parquet:
from deephaven.parquet import write
write(instrument_rolling_stats, "/data/crypto_prices_stats.parquet")
To create a static pandas DataFrame, use the to_pandas
method:
from deephaven.pandas import to_pandas
data_frame = to_pandas(instrument_rolling_stats)
- data_frame
7. What to do next
Now that you've imported data, created tables, and manipulated static and real-time data, we suggest heading to the Crash Course in Deephaven to learn more about Deephaven's real-time data platform.