24 powerful Deephaven tools for data analysis

November 25, 2022 · 9 min read

JJ Brosnan

Developer Relations Engineer @Deephaven

Tricks for tables every data scientist should know

Deephaven is the data backbone that powers hedge funds, banks, exchanges, and more. How do Deephaven experts leverage its power? I'll show how through 23 examples of simple operations used regularly by some of the world's top data experts. Don't be fooled by their simplicity - they are the building blocks for innovative analysis that large firms rely on time and time again.

Prerequisites

The only prerequisites for this blog are some version of Deephaven Community Core and Python installed on your machine. You have several installation options for Deephaven:

Any of the three options will get you started with the examples in this blog. In general, your choice of Deephaven deployment will depend on what you intend to do with it. Most users will find that the first or second option above are more than sufficient for their needs.

Examples

In most examples below, we will use the metric century bike ride dataset found in our examples repository. There are many more datasets there for you to play with, so check them out!

1. Read and write CSV files

Deephaven tables and CSV files are both column oriented. You can read CSV files that are local to your machine or on the web. You can write any table to a local CSV file as well.

In Deephaven, you can read a CSV file into a table using read_csv. The CSV file can be local to your machine (make sure it's visible to Deephaven) or remote on the web.

from deephaven import read_csv, write_csv

metric_century_local = read_csv("/data/examples/MetricCentury/csv/metriccentury.csv")
metric_century_remote = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/MetricCentury/csv/metriccentury.csv")

write_csv(metric_century_remote, "/data/metric_century.csv")

metric_century_local
metric_century_remote

2. Delete tables

There is no point to having two copies of the metric century table. You can delete them with del or = None.

metric_century = metric_century_remote

metric_century_remote = None
del(metric_century_local)

metric_century

3. Get the first or last N rows of a table

You don't always need all of the data. Sometimes you just want the first or last N rows of it.

n = 10 # Number of rows to keep

metric_century_head = metric_century.head(n)
metric_century_tail = metric_century.tail(n)

metric_century_head
metric_century_tail

4. Get the first or last N percent of a table

What if you want a certain percent of the data? This is very common in machine learning.

metric_century_head_pct = metric_century.head_pct(0.25)
metric_century_tail_pct = metric_century.tail_pct(0.10)

metric_century_head_pct
metric_century_tail_pct

5. Get table metadata

Deephaven tables can store a wide variety of data types, which gives you complete control of your data. Deephaven uses Java data types on the backend, and the metadata table shows you exactly what you've got.

metric_century_meta = metric_century.meta_table

metric_century_meta

6. Drop columns

You can omit columns from tables similar to how you omit rows.

cols_to_drop = ["Watts", "LatitudeDegrees", "LongitudeDegrees"]
metric_century_cols_dropped = metric_century.drop_columns(cols=cols_to_drop)

metric_century_cols_dropped

7. Get the number of rows in a table

Tables have a size attribute.

print(metric_century.size)

8. Filter data based on criteria

You can filter data out of tables on a per-column basis. Multiple conditions can be present in the same operation.

metric_century_where_moving = metric_century.where(["SpeedKPH > 0"])
metric_century_where_working = metric_century.where(["Cadence > 0", "Watts > 0"])

metric_century_where_moving
metric_century_where_working

9. Select a subset of table data

As previously mentioned, you don't always need all of the data in a table. But you also don't always want to explicitly drop columns. Sometimes you only want to select a subset. select and view do the trick. For an explanation on the difference between the two and which to use for your queries, see Choose the right selection method for your query.

metric_century_select = metric_century.select(["Time", "DistanceMeters", "SpeedKPH", "AltitudeMeters"])
metric_century_view = metric_century.view(["Time", "Cadence", "Watts", "HeartRate"])

metric_century_select
metric_century_view

10. Update a table

Deephaven has multiple ways to update a table: update, update_view, and lazy_update. For an explanation of the difference between the three and when to use each one, see Choose the right selection method for your query.

metric_century_update = metric_century.update(["SpeedMPS = SpeedKPH * 1000 / 3600"])
metric_century_updateview = metric_century_update.update_view(["AccelMPS2 = SpeedMPS_[i] - SpeedMPS_[i - 1]"])
metric_century_lazyupdate = metric_century_updateview.lazy_update(["IsAccelerating = AccelMPS2 > 0 ? `Yes` : `No`"])

metric_century_update
metric_century_updateview
metric_century_lazyupdate

11. Select unique values in columns

Want to see how many unique values exist in a column? Or how many unique combinations of values exist in more than one column? select_distinct makes it simple.

unique_heartrates = metric_century.select_distinct(["HeartRate"])
unique_heartrate_cadences = metric_century.select_distinct(["HeartRate", "Cadence"])

unique_heartrates
unique_heartrate_cadences

12. Reverse table order

Table ordering can easily be reversed. This is particularly useful in real-time tables, where reversing a table causes the newest data to appear at its top.

metric_century_reversed = metric_century.reverse()

metric_century_reversed

13. Rename columns

Column names aren't always indicative of a column's true meaning. In such a case, it's appropriate to rename them accordingly.

metric_century_renamed = metric_century.rename_columns(["RevsPerMinute = Cadence", "HeartRateBPM = HeartRate"])

metric_century_renamed

14. Sort columns

Columns in tables can be sorted in ascending or descending order. Sort direction can even vary from column to column in the same table.

from deephaven import SortDirection

sorted_ascending = metric_century.sort(["SpeedKPH", "Watts"])
sorted_descending = metric_century.sort_descending(["AltitudeMeters"])
sorted_both = metric_century.sort(["SpeedKPH", "Watts"], [SortDirection.ASCENDING, SortDirection.DESCENDING])

sorted_ascending
sorted_descending
sorted_both

15. Time bin

It's often useful to bin observations into buckets of time, such as when downsampling. Those buckets can be any amount of seconds, minutes, hours, etc.

binned_by_time = metric_century.update_view(["OneMinLowerBin = lowerBin(Time, MINUTE)", "FiveMinUpperBin = upperBin(Time, 5 * MINUTE)"])

binned_by_time

16. Color format columns and rows

It's useful, especially in real-time applications, to color columns and/or rows. Conditional coloring can make values of interest pop off the screen and be immediately obvious. Columns, rows, combinations, and even heat maps can be colored in either the background or foreground. A full list of colors can be found here.

metric_century_colorful = metric_century.format_columns(["Cadence = BLUE", "Watts = VIVID_YELLOW", "HeartRate = heatmap(HeartRate, 125, 145, LAWNGREEN, CRIMSON)"])
column_colors_conditional = metric_century.format_column_where("Cadence", "SpeedKPH = 0", "DARKORANGE")
row_colors_conditional = metric_century.format_row_where("SpeedKPH > 10", "fg(GREEN)")

metric_century_colorful
column_colors_conditional
row_colors_conditional

17. Replay static data in real time

Real-time applications often need proofs-of-concept before they can be used in live systems. A great way to test if your query will work in real time or not is to replay a static table. The static table must have time stamps in Deephaven's DateTime format.

from deephaven.replay import TableReplayer
from deephaven.time import to_datetime

start_time = to_datetime("2019-08-25T15:34:55Z")
end_time = to_datetime("2019-08-25T17:10:22Z")

replayer = TableReplayer(start_time, end_time)
metric_century_replayed = replayer.add_table(metric_century, "Time")
replayer.start()

18. Aggregate table data

An aggregation breaks data into subgroups and performs calculations on the grouped data. It has applications far and wide. Aggregations can be dedicated (a single calculation on subgroups) or combined (multiple calculations on subgroups).

from deephaven import agg as agg

count_by_heartrate = metric_century.count_by("Count", by=["HeartRate"])
multiple_aggs = metric_century.agg_by([agg.avg(cols=["AvgSpeed = SpeedKPH", "AvgCadence = Cadence"])], by=["HeartRate"])

count_by_heartrate
multiple_aggs

19. Read and write Parquet files

At the beginning of this blog, I showed how to read and write CSV files in Deephaven. Parquet is just like CSV, but it's compressed, so it's more efficient.

Want to efficiently store data? Read CSV files from the web and save them locally in Parquet format. You can store a ton more data by using Parquet over CSV.

from deephaven.parquet import read, write
from deephaven import read_csv

crypto_day1 = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/crypto_sept7.csv")
crypto_day2 = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/crypto_sept8.csv")

write(crypto_day1, "/data/crypto_sept7.parquet", compression_codec_name="GZIP")
write(crypto_day2, "/data/crypto_sept8.parquet", compression_codec_name="GZIP")

crypto_day1, crypto_day2 = None, None

crypto_sept7 = read("/data/crypto_sept7.parquet")
crypto_sept8 = read("/data/crypto_sept8.parquet")

crypto_sept7
crypto_sept8

20. Merge tables

Merging two or more tables vertically stacks them on top of one another. In the case of the two Parquet files we just read into memory, they contain the same set of crypto data for two different days. Thus, they can be easily merged together.

from deephaven import merge

crypto_merged = merge([crypto_sept7, crypto_sept8])

crypto_merged

21. Create tables from scratch

Deephaven has three ways to create tables from scratch:

empty_table
- Creates an empty table with a given number of rows, which you can add columns to via update, update_view, or lazy_update.
new_table
- Creates a new table with column definitions based on Java primitive data types such as int, double, and String.
time_table
- Creates a real-time table with a single column of Deephaven timestamps. A new row is added once every N seconds, where N is specified.

from deephaven import empty_table, new_table, time_table
from deephaven.column import int_col, double_col, string_col

source1 = new_table([
    int_col("Index", [1, 2, 3, 4, 5]),
    double_col("Value", [1/2, 2/3, 3/4, 4/5, 5/6]),
    string_col("Name", ["Art", "Cade", "Eric", "Sam", "Will"])
])
source2 = empty_table(5).update_view(["Index = i", "Sine = sin(0.1 * Index)"])
tt1 = time_table("PT00:00:01").lazy_update(["X = i % 4"])
tt2 = time_table("PT00:00:02").lazy_update(["Y = i % 3"])

source1
source2
tt1
tt2

22. Join tables

Joining tables is different from merging tables. In the example below, I use natural_join. There are a several joins available; for more information about which to use for your queries, see our guide.

sources_joined_on2 = source1.natural_join(table=source2, on="Index")
sources_joined_on1 = source2.natural_join(table=source1, on="Index")

sources_joined_on2
sources_joined_on1

23. Join time series data

Real-time tables typically have timestamps, ass seen in tt1 and tt2 in the Create tables from scratch section. In this case, an as-of join or a reverse as-of join is typically used.

tt2.aj(table=tt1, on=["Timestamp"])

24. Create a real-time table

Why not create your own real-time data? It's easy with DynamicTableWriter. My query below is only meant to illustrate how to use it - the data is simple. This method of creating real-time data, however, has many applications including real-time crypto feeds, classifying slack messages, and much more.

from deephaven import DynamicTableWriter
from deephaven import dtypes as dht

import threading, time, numpy as np

col_defs = {"X": dht.double, "Y": dht.double}

table_writer = DynamicTableWriter(col_defs)
real_time_table = table_writer.table

def write_to_table():
    for i in range(1000):
        val = 0.1 * i
        table_writer.write_row(val, np.sin(val))
        time.sleep(0.1)

thread = threading.Thread(target=write_to_table)
thread.start()

real_time_table

Want more?

Our documentation has loads of information, examples, and explanations to help you quickly become a Deephaven ninja. Reach out to us on Slack with feedback, questions, or more!

Prerequisites​

Examples​

1. Read and write CSV files​

2. Delete tables​

3. Get the first or last N rows of a table​

4. Get the first or last N percent of a table​

5. Get table metadata​

6. Drop columns​

7. Get the number of rows in a table​

8. Filter data based on criteria​

9. Select a subset of table data​

10. Update a table​

11. Select unique values in columns​

12. Reverse table order​

13. Rename columns​

14. Sort columns​

15. Time bin​

16. Color format columns and rows​

17. Replay static data in real time​

18. Aggregate table data​

19. Read and write Parquet files​

20. Merge tables​

21. Create tables from scratch​

22. Join tables​

23. Join time series data​

24. Create a real-time table​

Want more?​

Getting Started