Mingle Pandas, NumPy, and Deephaven

Using Deephaven from within Python is very easy. Let’s dive in.

Key Difference Between Deephaven and Pandas

Both Deephaven and Pandas provide functionality for working with tables of data -- Pandas dataframes and Deephaven tables. While both products can perform many of the same operations, the syntax and a few approaches are different.

The biggest conceptual difference between Pandas dataframes and Deephaven tables is mutability. There are two types of mutability to consider: recipe and data.

PandasDeephaven
RecipeMutableImmutable
DataImmutableMutable

First, let’s consider the recipe used to create a table. In Pandas, the recipe for creating a dataframe is mutable and can be changed. In this example, Pandas dataframe df is created. Next, df is modified by adding a new column.

import pandas as pd
df = pd.DataFrame({'A': [1,2,3]})
df['B'] = 3 + df['A']

The exact same scenario using Deephaven tables looks like:

from deephaven.TableTools import col, newTable
t1 = newTable(col('A',1,2,3))
t2 = t1.update("B=A+3")

Here Deephaven table t1 is created. Next, a new column is added to t1. Because the recipe to create Deephaven tables cannot be changed, a new table t2 is created, instead of modifying t1.

Next, consider data mutability. In Pandas, data within a table does not change. Dataframe content is static. To change data in a dataframe, the recipe for the dataframe must change. On the other hand, Deephaven tables support real-time, dynamic data. Once a Deephaven table is created, the data inside can dynamically change.

tDynamic = db.timeTable("00:00:01").view("X=i")

Here a Deephaven table is created which adds one new row per second. The code above defines the immutable recipe for creating the table. This recipe is then used to update the contents of the resulting table as new data becomes available.

Common Operations

Conversions Between Pandas and Deephaven

In many cases, it can be beneficial to convert between Deephaven tables and Pandas dataframes.

from deephaven import tableToDataFrame, dataFrameToTable, columnToNumpyArray
from deephaven.TableTools import emptyTable
from math import sqrt
t1 = emptyTable(10).update("X=i","Y=X*X+X")
y = columnToNumpyArray(t1,"Y")
print(y)
df = tableToDataFrame(t1)
df["Z"] = [sqrt(i) for i in range(10)]
t2 = dataFrameToTable(df)

Create a Table

Table creation tools are located in the deephaven.TableTools module.

import pandas as pd
df = pd.DataFrame({'A': [1,2,3]})

from deephaven.TableTools import col, newTable
t = newTable(col('A',1,2,3))

Load a Table

Deephaven tables can be loaded from files, such as CSV, or they can be loaded from tables already stored in the database.

import pandas as pd
df = pd.read_csv('foo.csv')
from deephaven.TableTools import readCsv
t1 = readCsv('foo.csv')
t2 = db.t("LearnDeephaven", "StockTrades")

It is also possible to load tables directly into the web user interface:

img

Add and Remove Columns

Deephaven tables have methods for adding and removing columns. The most commonly used methods are update, view, updateView, select, renameColumns, and dropColumns. Each of these methods has different memory and performance characteristics. See Select for details.

import pandas as pd
df = pd.DataFrame({'A': [1,2,3]})
df['B'] = 3 + df['A']
df = df.drop(columns=['A'])

from deephaven.TableTools import col, newTable
t = newTable(col('A',1,2,3)).update("B=A+3").dropColumns("A")

Deephaven’s query language can be used to make complex logic very readable. Here, a table is created which

  1. contains a subset of the original columns,
  2. has renamed the Last column to Price,
  3. added a new column Dollars, which is computed on demand, and
  4. added a new in-memory column Importance, which is computed from a user-defined Python function.
from math import sqrt

def compute_importance(price,size):
    return price*sqrt(size)

t = db.t("LearnDeephaven", "StockTrades") \
 .view("Date", "Sym", "Price=Last", "Size", "Dollars=Price*Size") \
 .update("Importance=(double)compute_importance.call(Price,Size)")

Filtering

Selecting a subset of a table’s data is very straightforward. For Deephaven, these selections will work for both real-time and static tables.

def is_even(x):
    return x % 2 == 0

import pandas as pd
df = pd.DataFrame({'A': [*range(10)]})
df2 = df[df['A'] % 2 == 0]
df3 = df[is_even(df['A'])]

from deephaven.TableTools import emptyTable
t = emptyTable(10).update("A=i")
t2 = t.where("A % 2 == 0")
t3 = t.where("(boolean)is_even.call(A)")

Here, the data selection is performed using a simple statement as well as a user-defined function. In addition to basic filtering, Deephaven provides more advanced filtering options.

from deephaven.TableTools import emptyTable, col, newTable
t = emptyTable(10).update("A=i", "B=i*i")
x = newTable(col('A',1,2,3))
t1 = t.where("A % 2 = 0", "B>3")
t2 = t.where("A % 2 = 0 || B>3")
t3 = t.where("A in 0, 3, 5")
t4 = t.where("A not in 0, 3, 5")
t5 = t.head(3)
t6 = t.tail(3)
t7 = t.whereIn(x, "A")
t8 = t.whereNotIn(x, "A")

Metadata

When working with tables, it is important to understand the data contained within the table. In Pandas, table metadata can be obtained through df.frames. In Deephaven, a metadata table can be obtained through getMeta().

import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [1.1,2.2,3.3]})
print(df.dtypes)

from deephaven.TableTools import col, newTable
t = newTable(col('A',1,2,3), col('B',1.1,2.2,3.3))
m = t.getMeta()

img

In addition to basic column name and column type information, getMeta() provides details about how columns are stored, so that you can make informed decisions about how to structure your query. See Think like a Deephaven ninja for details.

Sorting

Sort tables by the contents of columns. For Deephaven, these sorts will work for both real-time and static tables.

import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4], 'B': [7,2,2,4], 'C': [0, 1, 2, 3]})
df1 = df.sort_values(by='B')
df2 = df.sort_values(by='B', ascending=False)
df3 = df.sort_values(by=['B','C'], ascending=[True,False])

from deephaven.TableTools import col, newTable
from deephaven import SortPair
t = newTable(col('A',1,2,3,4), col('B',7,2,2,4), col('C', 0, 1, 2, 3))
t1 = t.sort("B")
t2 = t.sortDescending("B")
t3 = t.sort(SortPair.ascending("B"), SortPair.descending("C"))

Concatenate

Concatenate multiple tables into a single table. For Deephaven, these concatenations will work for both real-time and static tables.

import pandas as pd
df1 = pd.DataFrame({'A': [1,2]})
df2 = pd.DataFrame({'A': [3,4]})
df = pd.concat([df1, df2])

from deephaven.TableTools import col, newTable, merge
t1 = newTable(col('A',1,2))
t2 = newTable(col('A',3,4))
t = merge(t1, t2)

Aggregations

Data can be aggregated and analyzed in many different ways. The most simple case is averaging the values in a table.

import pandas as pd
df = pd.DataFrame({'A': [1,1,2,2], 'B': [7,2,2,4], 'C': [0, 1, 2, 3]})
print(df.mean())

The same operation is easy in Deephaven. They work with both real-time and static tables. Additionally, the data can be grouped by key columns and averaged (t2 below).

from deephaven.TableTools import col, newTable
t = newTable(col('A',1,1,2,2), col('B',7,2,2,4), col('C',0,1,2,3))
t1 = t.avgBy()
t2 = t.avgBy("A")

Deephaven tables have many of these dedicated aggregators. The most commonly used are firstBy, lastBy, sumBy, avgBy, stdBy, varBy, medianBy, minBy, maxBy, countBy, headBy, and tailBy.

Sometimes you will want to compute more than one type of aggregation on a table at once. For example, you would like to compute a min and a max of a column. This can be done using combined aggregations. Combined aggregations work with both real-time and static tables, and data can be grouped by key columns (t2 below).

from deephaven.TableTools import col, newTable
from deephaven.ComboAggregateFactory import AggCombo, AggMin, AggMax, AggAvg
t = newTable(col('A',1,1,2,2), col('B',7,2,2,4), col('C',0,1,2,3))
t1 = t.by(AggCombo(AggMin("MinB=B"),AggMax("MaxB=B"),AggAvg("AvgB=B")))
t2 = t.by(AggCombo(AggMin("MinB=B"),AggMax("MaxB=B"),AggAvg("AvgB=B")), "A")

The most commonly used combined aggregations are AggMin, AggMax, AggSum, AggVar, AggAvg, AggWAvg, AggStd, AggFirst, AggLast, AggMed, AggPct, AggCount, and AggArray.

Deephaven tables also have the ability to group data into and ungroup data from arrays. This is somewhat similar to groupby in Pandas.

from deephaven.TableTools import col, newTable
t = newTable(col('A',1,1,2,2), col('B',7,2,2,4), col('C',0,1,2,3))
t1 = t.by("A")
t2 = t1.update("Sum = sum(B)")
t3 = t2.ungroup()

Finally, Deephaven tables support user-defined aggregations.

Note

See also: Aggregate

from deephaven.TableTools import col, newTable
t = newTable(col('A',1,1,2,2), col('B',1.0,4,9,16))
t1 = t.applyToAllBy("sqrt(sum(each))", 'A')

Joins

There are many ways to join data from multiple tables into a single table. The most common Deephaven join operations are naturalJoin, exactJoin, aj (as-of join), raj (reverse as-of join), leftJoin, and join. All of these joins work for both real-time and static tables. Many of these operations do not have analogs in Pandas. Full coverage of joins is beyond the scope of this document.

Note

See also: Join

import pandas as pd
dfLeft = pd.DataFrame({'A': ['A','B'], 'B': [1,2]})
dfRight = pd.DataFrame({'A': ['A','B'], 'C': [6,7]})
df = pd.merge(dfLeft, dfRight, on='A')

from deephaven.TableTools import col, newTable
tLeft = newTable(col('A','A','B'), col('B',1,2))
tRight = newTable(col('A','A','B'), col('C',6,7))
t = tLeft.exactJoin(tRight,"A")

Missing Data

Each Deephaven datatype has a special value to indicate missing data. Primitive data types use NULL_FLOAT, NULL_DOUBLE, NULL_CHAR, NULL_BYTE, NULL_SHORT, NULL_INT, and NULL_LONG. Complex data types use Python’s None / Java’s null.

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,2,3], 'B': [1.1,np.nan,3.3]})
df1 = df.isna()
df2 = df.dropna()
df3 = df.fillna(value=7.3)

from deephaven.TableTools import col, newTable
from deephaven.conversion_utils import NULL_DOUBLE
t = newTable(col('A',1,2,3), col('B',1.1,NULL_DOUBLE,3.3))
t1 = t.update("IsMissing = isNull(B)")
t2 = t.where("!isNull(B)")
t3 = t.update("B=isNull(B) ? 7.3 : B")

String Operations

In the Deephaven Query Language, Java String methods can be used.

from deephaven.TableTools import col, newTable
t = newTable(col('A',"aBc","X/Y/Z")) \
 .update("Len=A.length()", "Append=A+`_suffix`","Split=A.split(`/`)", "First=A.split(`/`)[0]", "Lower=A.toLowerCase()", "Upper=A.toUpperCase()", "Replace=A.replace(`/`,`_`)")

Plotting

Just as Pandas has extensive plotting functionality built-in and via matplotlib, Deephaven has a powerful plotting library which can operate on real-time, static, and OneClick tables.

Note

See also: Plotting

from deephaven.TableTools import emptyTable
from deephaven import Plot
t = emptyTable(100).update("X=i","Y=sqrt(X)")
plot = Plot.plot("Sqrt", t, "X", "Y").show()

img