Filter table data

This guide covers filtering table data in Deephaven. Many different table operations can be used to filter unwanted data out of a table. Some remove data based on conditional formulas, whereas others remove data based on row indices or columns.

The following table operations remove data that does not meet the conditions set forth by one or more conditional filter formulas:

The following table operations remove data based on row indices:

The following table operations remove data based on equality in one or more columns of a separate table:

Conditional filtering

Conditional filtering applies one or more comparison formulas to remove data that does not meet the specified criteria. Comparison formulas can use match filters, range filters, string filters, and regular expression filters to remove unwanted data. The comparison formulas used in conditional filtering must equate to either true or false. When true, the data is kept in the table. When false, the data is removed.

The following code block imports the Iris dataset found in Deephaven's example repository. Subsequent examples filter this data using comparison formulas:

from deephaven import read_csv

iris = read_csv(
    "https://media.githubusercontent.com/media/deephaven/examples/main/Iris/csv/iris.csv"
)

Match filters

Match filters set forth one of the following criteria:

Equality (= and ==): If a value is equal to another value, the result is true. Otherwise, the result is false.
Inequality (!=): If a value is not equal to another value, the result is true. Otherwise, the result is false.
in: If a value exists within a set of values, the result is true. Otherwise, the result is false.
not in: If a value does not exist within a set of values, the result is true. Otherwise, the result is false.
icase in: The same as in, but capitalization of letters is ignored. This should only be used to filter string columns.
icase not in: The same as not in, but capitalization of letters is ignored. This should only be used to filter string columns.

The following code block applies an equality filter and an inequality filter:

filtered_by_sepal_width = iris.where(filters=["SepalWidthCM = 3.5"])
not_virginica = iris.where(filters=["Class != 'Iris-virginica'"])

The following code block applies in, not in, icase in, and icase not in:

setosa_and_virginica = iris.where(filters=["Class in `Iris-setosa`, `Iris-virginica`"])
not_setosa_or_virginica = iris.where(
    filters=["Class not in `Iris-setosa`, `Iris-virginica`"]
)
virginica = iris.where(filters=["Class icase in `iris-virginica`"])
not_versicolor = iris.where(filters=["Class icase not in `iris-versicolor`"])

Range filters

Range filters remove data that does not fall into a range of values. Range filters use any of the following operators:

<: Less than
<=: Less than or equal to
>: Greater than
>=: Greater than or equal to
inRange: Checks if a value is in a given range

The following code block applies each operator to remove unwanted data:

sepalwidth_lessthan_3 = iris.where(filters=["SepalWidthCM < 3.0"])
petallength_greaterthan_2 = iris.where(filters=["PetalLengthCM > 2.0"])
sepallength_greaterthan_or_equalto_6 = iris.where(filters=["SepalLengthCM >= 6.0"])
petalwidth_lessthan_or_equalto_1 = iris.where(filters=["PetalWidthCM <= 1"])
petalwidth_inrange = iris.where(filters=["inRange(PetalWidthCM, 0, 1)"])

String filters

Strings in Deephaven tables are stored as Java strings. Any method that can be called on a Java string can be used to filter string data. Methods such as startsWith, endswith, contains, and matches are particularly useful.

The following code block applies these methods to filter the iris table on its String column, Class:

class_startswith = iris.where(filters="Class.startsWith(`Iris`)")
class_endswith = iris.where(filters="Class.endsWith(`setosa`)")
class_contains = iris.where(filters="Class.contains(`vir`)")
class_matches = iris.where(filters="Class.matches(`Iris-versicolor`)")

Regular expression (regex) filters

Regular expression filtering is another type of string filtering that uses regular expressions to remove unwanted data. Deephaven's filters submodule enables the use of regex in filtering operations.

When performing regex filtering with the filters submodule, the filters are applied with a specific pattern mode. Two pattern modes are available:

FIND looks for a subsequence match in the string.
MATCHES matches an entire string against the given pattern.

The following code block performs regular expression filtering on the iris table with both pattern modes. The first finds strings that are eleven characters long, whereas the second looks for the subsequence virginica in the Class column:

from deephaven.filters import PatternMode, pattern

filter_eleven_chars = pattern(PatternMode.MATCHES, "Class", "...........")
filter_regex_match = pattern(PatternMode.FIND, "Class", "virginica")

iris_eleven_chars = iris.where(filter_eleven_chars)
iris_virginica = iris.where(filter_regex_match)

Combine filters

Multiple match and/or conditional statements can be combined to filter data in a table. These combinations can be either conjunctive or disjunctive.

Conjunctive

Conjunctive filtering is used to return a table where all conditional filters in a where clause return true. Conjunctive filtering can be achieved by passing multiple filters into where via a list, or by using the && operator in a single filter.

In the following example, a conjunctive filter is applied to the iris table to produce a new table of only Iris setosa flowers with a petal length in a specific range.

conjunctive_filtered_iris = iris.where(
    filters=["Class in `Iris-setosa`", "PetalLengthCM >= 1.3 && PetalLengthCM <= 1.6"]
)

Disjunctive

Disjunctive filtering is used to return a table where one or more of the statements return true. This can be achieved using the || operator in a where statement or with where_one_of.

In the following example, two filters work disjunctively to return a new table where the petal length is greater than 1.9 cm or less than 1.3 cm.

or_filtered_iris = iris.where(filters=["PetalLengthCM > 1.9 || PetalWidthCM < 1.3"])
iris_where_one_of = iris.where_one_of(
    filters=["PetalLengthCM > 1.9", "PetalWidthCM < 1.3"]
)

Filter one table based on another

The where_in and where_not_in methods enable filtering of one table based on another table. These two methods are evaluated whenever either table changes, whereas where is only evaluated when the filtered table ticks. In the example below, the where_in and where_not_in methods are used to find Iris virginica sepal widths that match and do not match Iris versicolor sepal widths:

virginica = iris.where(filters=["Class in `Iris-virginica`"])
versicolor = iris.where(filters=["Class in `Iris-versicolor`"])
virginica_matching_petal_widths = virginica.where_in(
    filter_table=versicolor, cols=["PetalWidthCM"]
)
virginica_non_matching_petal_widths = virginica.where_not_in(
    filter_table=versicolor, cols=["PetalWidthCM"]
)

Caution

where_in and where_not_in are inefficient if the filter table updates frequently.

Tip

Unlike natural_join, where_in can be used when there is more than one matching value in the right table for values in the left table. This is true of join as well, but where_in is faster at returning matching rows than join.

where_in only provides filtering and does not allow columns to be added from the right table. In some cases, it may be desirable to use where_in to filter and then join to add columns from the right table. This provides similar performance to natural_join while still allowing matches from the right table.

Filter by row index

Filtering by row index removes unwanted data at the top, middle, or end of a table. Row indices can be chosen on their own or by percentage of the total size of a table.

The following example uses head, tail, and slice to keep only the first, middle, and last 10 rows of iris, respectively.

iris_head = iris.head(10)
iris_slice = iris.slice(70, 80)
iris_tail = iris.tail(10)

The following example uses head_pct, slice_pct, and tail_pct to keep only 10% of the rows at the top, middle, and end of the table, respectively.

iris_head_pct = iris.head_pct(0.1)
iris_slice_pct = iris.slice_pct(0.45, 0.55)
iris_tail_pct = iris.tail_pct(0.1)