Filter table data
This guide covers filtering table data in Deephaven. Many different table operations can be used to filter unwanted data out of a table. Some remove data based on conditional formulas, whereas others remove data based on row indices or columns.
The following table operations remove data that does not meet the conditions set forth by one or more conditional filter formulas:
The following table operations remove data based on row indices:
The following table operations remove data based on equality in one or more columns of a separate table:
Conditional filtering
Conditional filtering applies one or more comparison formulas to remove data that does not meet the specified criteria. Comparison formulas can use match filters, range filters, string filters, and regular expression filters to remove unwanted data. The comparison formulas used in conditional filtering must equate to either true
or false
. When true
, the data is kept in the table. When false
, the data is removed.
The following code block imports the Iris dataset found in Deephaven's example repository. Subsequent examples filter this data using comparison formulas:
from deephaven import read_csv
iris = read_csv(
"https://media.githubusercontent.com/media/deephaven/examples/main/Iris/csv/iris.csv"
)
- iris
Match filters
Match filters set forth one of the following criteria:
- Equality (
=
and==
): If a value is equal to another value, the result istrue
. Otherwise, the result isfalse
. - Inequality (
!=
): If a value is not equal to another value, the result istrue
. Otherwise, the result isfalse
. in
: If a value exists within a set of values, the result istrue
. Otherwise, the result isfalse
.not in
: If a value does not exist within a set of values, the result istrue
. Otherwise, the result isfalse
.icase in
: The same asin
, but capitalization of letters is ignored. This should only be used to filter string columns.icase not in
: The same asnot in
, but capitalization of letters is ignored. This should only be used to filter string columns.
The following code block applies an equality filter and an inequality filter:
filtered_by_sepal_width = iris.where(filters=["SepalWidthCM = 3.5"])
not_virginica = iris.where(filters=["Class != 'Iris-virginica'"])
- filtered_by_sepal_width
- not_virginica
The following code block applies in
, not in
, icase in
, and icase not in
:
setosa_and_virginica = iris.where(filters=["Class in `Iris-setosa`, `Iris-virginica`"])
not_setosa_or_virginica = iris.where(
filters=["Class not in `Iris-setosa`, `Iris-virginica`"]
)
virginica = iris.where(filters=["Class icase in `iris-virginica`"])
not_versicolor = iris.where(filters=["Class icase not in `iris-versicolor`"])
- setosa_and_virginica
- not_setosa_or_virginica
- virginica
- not_versicolor
Range filters
Range filters remove data that does not fall into a range of values. Range filters use any of the following operators:
<
: Less than<=
: Less than or equal to>
: Greater than>=
: Greater than or equal toinRange
: Checks if a value is in a given range
The following code block applies each operator to remove unwanted data:
sepalwidth_lessthan_3 = iris.where(filters=["SepalWidthCM < 3.0"])
petallength_greaterthan_2 = iris.where(filters=["PetalLengthCM > 2.0"])
sepallength_greaterthan_or_equalto_6 = iris.where(filters=["SepalLengthCM >= 6.0"])
petalwidth_lessthan_or_equalto_1 = iris.where(filters=["PetalWidthCM <= 1"])
petalwidth_inrange = iris.where(filters=["inRange(PetalWidthCM, 0, 1)"])
- sepalwidth_lessthan_3
- petallength_greaterthan_2
- sepallength_greaterthan_or_equalto_6
- petalwidth_lessthan_or_equalto_1
- petalwidth_inrange
String filters
Strings in Deephaven tables are stored as Java strings. Any method that can be called on a Java string can be used to filter string data. Methods such as startsWith
, endswith
, contains
, and matches
are particularly useful.
The following code block applies these methods to filter the iris
table on its String column, Class
:
class_startswith = iris.where(filters="Class.startsWith(`Iris`)")
class_endswith = iris.where(filters="Class.endsWith(`setosa`)")
class_contains = iris.where(filters="Class.contains(`vir`)")
class_matches = iris.where(filters="Class.matches(`Iris-versicolor`)")
- class_startswith
- class_endswith
- class_contains
- class_matches
Regular expression (regex) filters
Regular expression filtering is another type of string filtering that uses regular expressions to remove unwanted data. Deephaven's filters
submodule enables the use of regex in filtering operations.
When performing regex filtering with the filters
submodule, the filters are applied with a specific pattern mode. Two pattern modes are available:
FIND
looks for a subsequence match in the string.MATCHES
matches an entire string against the given pattern.
The following code block performs regular expression filtering on the iris
table with both pattern modes. The first finds strings that are eleven characters long, whereas the second looks for the subsequence virginica
in the Class
column:
from deephaven.filters import PatternMode, pattern
filter_eleven_chars = pattern(PatternMode.MATCHES, "Class", "...........")
filter_regex_match = pattern(PatternMode.FIND, "Class", "virginica")
iris_eleven_chars = iris.where(filter_eleven_chars)
iris_virginica = iris.where(filter_regex_match)
- iris_eleven_chars
- iris_virginica
Combine filters
Multiple match and/or conditional statements can be combined to filter data in a table. These combinations can be either conjunctive or disjunctive.
Conjunctive
Conjunctive filtering is used to return a table where all conditional filters in a where
clause return true
. Conjunctive filtering can be achieved by passing multiple filters into where
via a list, or by using the &&
operator in a single filter.
In the following example, a conjunctive filter is applied to the iris
table to produce a new table of only Iris setosa flowers with a petal length in a specific range.
conjunctive_filtered_iris = iris.where(
filters=["Class in `Iris-setosa`", "PetalLengthCM >= 1.3 && PetalLengthCM <= 1.6"]
)
- conjunctive_filtered_iris
Disjunctive
Disjunctive filtering is used to return a table where one or more of the statements return true
. This can be achieved using the ||
operator in a where
statement or with where_one_of
.
In the following example, two filters work disjunctively to return a new table where the petal length is greater than 1.9 cm or less than 1.3 cm.
or_filtered_iris = iris.where(filters=["PetalLengthCM > 1.9 || PetalWidthCM < 1.3"])
iris_where_one_of = iris.where_one_of(
filters=["PetalLengthCM > 1.9", "PetalWidthCM < 1.3"]
)
- or_filtered_iris
- iris_where_one_of
Filter one table based on another
The where_in
and where_not_in
methods enable filtering of one table based on another table. These two methods are evaluated whenever either table changes, whereas where
is only evaluated when the filtered table ticks. In the example below, the where_in
and where_not_in
methods are used to find Iris virginica sepal widths that match and do not match Iris versicolor sepal widths:
virginica = iris.where(filters=["Class in `Iris-virginica`"])
versicolor = iris.where(filters=["Class in `Iris-versicolor`"])
virginica_matching_petal_widths = virginica.where_in(
filter_table=versicolor, cols=["PetalWidthCM"]
)
virginica_non_matching_petal_widths = virginica.where_not_in(
filter_table=versicolor, cols=["PetalWidthCM"]
)
- virginica
- versicolor
- virginica_matching_petal_widths
- virginica_non_matching_petal_widths
where_in
and where_not_in
are inefficient if the filter table updates frequently.
Unlike natural_join
, where_in
can be used when there is more than one matching value in the right table for values in the left table. This is true of join
as well, but where_in
is faster at returning matching rows than join
.
where_in
only provides filtering and does not allow columns to be added from the right table. In some cases, it may be desirable to use where_in
to filter and then join
to add columns from the right table. This provides similar performance to natural_join
while still allowing matches from the right table.
Filter by row index
Filtering by row index removes unwanted data at the top, middle, or end of a table. Row indices can be chosen on their own or by percentage of the total size of a table.
The following example uses head
, tail
, and slice
to keep only the first, middle, and last 10 rows of iris
, respectively.
iris_head = iris.head(10)
iris_slice = iris.slice(70, 80)
iris_tail = iris.tail(10)
- iris_head
- iris_slice
- iris_tail
The following example uses head_pct
, slice_pct
, and tail_pct
to keep only 10% of the rows at the top, middle, and end of the table, respectively.
iris_head_pct = iris.head_pct(0.1)
iris_slice_pct = iris.slice_pct(0.45, 0.55)
iris_tail_pct = iris.tail_pct(0.1)
- iris_head_pct
- iris_slice_pct
- iris_tail_pct