Skip to main content
Version: Python

How to use Pandas in Python queries

This guide shows you how to use Pandas in your Python queries in Deephaven.

Pandas is the most used library for data analysis and manipulation in Python. Using Pandas with Deephaven can make your queries more flexible.

Table to DataFrame

A DataFrame is a two-dimensional tabular data structure that is native to Pandas. With Deephaven, we can convert between Deephaven tables and Pandas DataFrames.

First, let's create a Deephaven table.

from deephaven import new_table

from deephaven.column import string_col, int_col, float_col, double_col

source = new_table([
string_col("Strings", ["String 1", "String 2", "String 3"]),
int_col("Ints", [4, 5, 6]),
float_col("Floats", [9.9, 8.8, 7.7]),
double_col("Doubles", [0.1, 0.2, 0.3])
])

To convert the Deephaven table to a data frame, import the deephaven.pandas.to_pandas method and then perform the conversion. To see the data frame, we print it.

from deephaven import pandas as dhpd

data_frame = dhpd.to_pandas(source)
print(data_frame)

DataFrame to Table

Users often perform analysis which results in a Pandas data frame. To convert a data frame to a Deephaven table, we start with the data frame created above and map that to a Deephaven table using the pandas.to_table method.

from deephaven import pandas as dhpd

new_table = dhpd.to_table(data_frame)

The new Deephaven table will display in the IDE and the data will match the original data. To check that the data type conversions are accurate, we can look at the table metadata.

For the data frame, we print the data types in the Console. For the Deephaven table, we create a new table containing the metadata information.

print(data_frame.dtypes)

meta_table = new_table.meta_table

Pandas uses float32 and float64 data types, which are equivalent to float and double in Deephaven. These are the same type and require the same memory. A String in Deephaven is an object in Pandas.

How to do common Pandas operations in Deephaven

Deephaven tables and Pandas data frames both contain tabular data. In both cases, users want to perform the same kinds of operations, such as creating tables, filtering tables, and aggregating tables. Below we present how to do the same operations with both Pandas and Deephaven.

In these examples, keep in mind that Pandas data frames are mutable while Deephaven tables are immutable but can have data that changes dynamically. This results in differences in how some operations are approached.

Create a table

Creating a Pandas data frame or Deephaven table is very similar.

import pandas as pd

data_frame = pd.DataFrame(
{'A': [1, 2, 3],
'B': ['X', 'Y', 'Z']}
)

print(data_frame)


from deephaven import new_table
from deephaven.column import int_col, string_col

table = new_table([
int_col('A', [1, 2, 3]),
string_col('B', ['X', 'Y', 'Z'])
])

Metadata

Let's explore how the data is handled by looking at the metadata:

print(data_frame.dtypes)

meta_DH_table = table.meta_table

Pandas has fewer data types than Deephaven, so notice that column B in table is appropriately a char, while the column type in data_frame is an object.

Deephaven supports many data types. To learn more about creating tables, see our guide How to create a table with new_table.

Column manipulation

You'll often want to perform operations on whole columns. Deephaven has various methods for viewing, selecting, updating, eliminating, changing, and creating columns of data in tables. The choice of each can result in performance differences. See our guide Choosing the right selection method for your query or How to select, view, and update data in tables for detailed advice.

Add

In this case, we wish to add a column C that is equal to column A plus 5.

added_data_frame = data_frame.assign(C=data_frame['A'] + 5)
print(added_data_frame)


added_table = table.update(formulas=["C = A + 5"])

Remove

We can remove whole columns with drop in Pandas or drop_columns in Deephaven.

dropped_data_frame = data_frame.drop(columns=['A'])
print(dropped_data_frame)


dropped_table = table.drop_columns(cols=["A"])

Rename

Renaming columns in a data frame or Deephaven table is simple:

data_frame.rename(columns={"A": "X", "B": "B"}, inplace=True)
print(data_frame)


renamed_table = table.rename_columns(cols=["X = A"])

Filter

Deephaven offers several types of filters. See our article on how to use filters.

Filtering in Deephaven and Pandas has the same logic.

First, define a data set:

import pandas as pd

data_frame = pd.DataFrame(
{'A': [1, 2, 3],
'B': ['X', 'Y', 'Z']}
)

print(data_frame)


from deephaven import new_table
from deephaven.column import int_col, string_col

table = new_table([
int_col('A', [1, 2, 3]),
string_col('B', ['X', 'Y', 'Z'])
])

where

We can limit the columns to certain values that match a formula. See our article on how to use formulas.

filtered_data_frame = data_frame[data_frame.A < 2]
print(filtered_data_frame)

filtered_table = table.where(filters=["A < 2"])

Head and tail

We can also return just the head or tail of the data frame or table. Below, we request the first three rows:

head_data_frame = data_frame.iloc[:3]
print(head_data_frame)


head_table = table.head(3)

Below, we request the last three rows:

tail_data_frame = data_frame.iloc[-3:]
print(tail_data_frame)

tail_table = table.tail(3)

Sort

Sorting changes the order of values in a data set based upon comparison operations. All data is still in the data set but in a different order. In Deephaven, data can be sorted by query or from the UI. UI sorting only changes how the data is displayed. It does not change the underlying data.

For this example, we want a slightly larger data set:

import pandas as pd

data_frame = pd.DataFrame(
{'A': [1, 2, 3, 4, 5, 6],
'B': ['Z', 'Y', 'X', 'X', 'Y', 'Z'],
'C': [7, 2, 1, 5, 3, 4]}
)

print(data_frame)


from deephaven import new_table, SortDirection
from deephaven.column import int_col, string_col

table = new_table([
int_col('A', [1, 2, 3, 4, 5, 6]),
string_col('B', ['Z', 'Y', 'X', 'X', 'Y', 'Z']),
int_col('C', [7, 2, 1, 5, 3, 4])
])

Sort ascending

We can sort in ascending order on a data set for a data frame or table:

sorted_data_frame = data_frame.sort_values(by='B')
print(sorted_data_frame)


sorted_table = table.sort(order_by=["B"])

Sort descending

We can sort descending on a data set for a data frame or table:

sorted_data_frame = data_frame.sort_values(by='B', ascending=False)
print(sorted_data_frame)


sorted_table = table.sort(order_by=["B"], order=[SortDirection.DESCENDING])

Specify sort direction for each column

To sort on different directions with one query, use SortColumns in the sort argument:

sorted_data_frame = data_frame.sort_values(by=['B','C'], ascending=[True,False])
print(sorted_data_frame)

sort_columns = [
SortDirection.ASCENDING,
SortDirection.DESCENDING
]

sorted_table = table.sort(order_by=['B','C'], order=sort_columns)

Combine data

Concat

In Pandas, concat allows tables to be vertically combined, stacked on top of each other. The same operation can be performed using merge on Deephaven tables. The combined columns should have the same data type.

import pandas as pd

data_frame1 = pd.DataFrame({'A': [1, 2]})
data_frame2 = pd.DataFrame({'A': [3, 4]})
data_frame = pd.concat([data_frame1, data_frame2])

print(data_frame)


from deephaven import merge, new_table
from deephaven.column import int_col


table1 = new_table([int_col('A', [1, 2])])
table2 = new_table([int_col('A', [3, 4])])
table = merge([table1, table2])

Join

Deephaven's many join methods combine data by appending the columns of one data set to another. See our guide How to join tables to learn more.

Pandas and Deephaven provide many of the same join methods, but there is not a one-to-one mapping of methods. In addition to the common join methods, Deephaven also provides inexact joins , such as aj (as-of join) and raj (reverse as-of join), for analyzing time series, which are not present in Pandas.

import pandas as pd

data_frameLeft = pd.DataFrame({'A': [1, 2, 3], 'B': ['X', 'Y', 'Z']})
data_frameRight = pd.DataFrame({'A': [3, 4, 5], 'C': ['L', 'M', 'N']})
data_frame = pd.merge(data_frameLeft, data_frameRight, on='A')

print( data_frame)


from deephaven import new_table
from deephaven.column import int_col, string_col

tableLeft = new_table([int_col("A", [1, 2, 3]), string_col("B", ['X', 'Y', 'Z'])])
tableRight = new_table([int_col("A", [3, 4, 5]), string_col("C", ['L', 'M', 'N'])])
table = tableLeft.join(table=tableRight, on=["A"])

Group data

Aggregate data

You'll often want to partition your data into groups and then compute values for the groups. Deephaven supports many kinds of data aggregations. There are more methods than can be covered here, so see our guides How to perform dedicated aggregations for groups and How to perform combined aggregations.

import pandas as pd

data_frame = pd.DataFrame(
{'A': [1, 3, 5],
'B': [5, 7, 9]}
)

avg_data_frame = data_frame.mean()
print(avg_data_frame)


from deephaven import new_table
from deephaven.column import int_col

table = new_table([
int_col("A", [1, 3, 5]),
int_col("B", [5, 7, 9])
])

avg_table = table.avg_by()

groupby

In this example, we first group the data, then apply a sum on that group. For more information on grouping, see our How to group and ungroup data guide.

import pandas as pd

data_frame = pd.DataFrame(
{'A': [1, 2, 1, 2, 1, 2],
'B': [2, 2, 5, 1, 3, 4]}
)

grouped_data_frame = data_frame.groupby(['A']).sum()
print(grouped_data_frame)


from deephaven import new_table
from deephaven.column import int_col

table = new_table([
int_col('A', [1, 2, 1, 2, 1, 2]),
int_col('B', [2, 2, 5, 1, 3, 4])
])

grouped_table1 = table.group_by(by=["A"]).view(formulas=["Sum = sum(B)"])

from deephaven import agg as agg

grouped_table2 = table.agg_by([agg.sum_(cols=["B"])], by=["A"])

Clean data

If your data set has null or NaN values, you'll probably want to remove or replace them before performing analysis. See our guide How to handle null, infinity, and not-a-number values for information on these data types in Deephaven.

In this example, we define a data set with a missing value. Pandas uses np.nan to represent missing double values, while Deephaven uses NULL_DOUBLE.

import pandas as pd
import numpy as np

data_frame = pd.DataFrame(
{'A': [1.0, 2.0, 3.0],
'B': [4.0, 2.0, np.nan]}
)

print(data_frame)

from deephaven import new_table
from deephaven.column import double_col
from deephaven.constants import NULL_DOUBLE

table = new_table([
double_col('A', [1.0, 2.0, 3.0]),
double_col('B', [4.0, 2.0, NULL_DOUBLE])
])

We can filter the data sets to remove the missing values.

remove_values_data_frame = data_frame.dropna()
print(remove_values_data_frame)


remove_values_table = table.where(filters=["!isNull(B)"])

Or we can replace the missing values.

replace_values_data_frame = data_frame.fillna(value=0.0)
print(replace_values_data_frame)


replace_values_table = table.update(formulas=["B = isNull(B) ? 0.0 : B"])