How to use filter query methods
This guide discusses how to properly use filters to exclude unwanted data from analysis in Deephaven. Topics covered include match and conditional filtering, conjunctive and disjunctive filtering, and filtering with head
and tail
.
For this how-to guide, we'll use the example data found in Deephaven's examples repository.
In this guide, we read data from Deephaven's examples repository. You can also load files that are in a mounted directory at the base of the Docker container. See Docker data volumes to learn more about the relation between locations in the container and the local file system.
To illustrate filtering in Deephaven, we'll use the Iris data set from the examples. This data set contains observations about Iris flowers from R. A. Fisher's classic 1936 paper, "The Use of Multiple Measurements in Taxonomic Problems". The paper describes categorizing plant varieties by using observable metrics. The data is often used to demonstrate machine learning categorization algorithms.
from deephaven import read_csv
iris = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/Iris/csv/iris.csv")
- iris
This produces the iris
table, which has five columns and 150 rows. The first four columns contain Iris measurement data, while the fifth column, Class
, is the Iris species name. The image below shows the first few entries:
Next, we'll show various ways to filter the data.
Match filters
Match filters use where
to filter out unwanted data. They come in six different flavors:
- equals (
=
and==
) - not equals (
!=
) in
not in
icase in
icase not in
equals (=
and ==
)
This method returns rows that have a matching value in a specified column. In the example below, the new table filtered_by_sepal_width
contains only the rows from the iris
table with a 3.5 cm sepal width.
filtered_by_sepal_width = iris.where(filters=["SepalWidthCM = 3.5"])
- filtered_by_sepal_width
The single equals (=
) and double equals (==
) can be used interchangeably in filters.
not equals (!=
)
This method returns rows that do not have a matching value in a specified column. In the example below, the new table not_filtered_by_virginica
contains all rows from the iris
table except those where the Class
row is equal to "Iris-virginica"
.
not_filtered_by_virginica = iris.where(filters=["Class != 'Iris-virginica'"])
- not_filtered_by_virginica
in
This method returns rows that contain a match of one or more values in a specified column. In the example below, the new table contains only Iris setosa and virginica flowers.
setosa_and_virginica = iris.where(filters=["Class in `Iris-setosa`, `Iris-virginica`"])
- setosa_and_virginica
not in
This method returns rows that do not contain a match of one or more values in a specified column. In the example below, the new table versicolor
contains only Iris versicolor flowers.
not_setosa_or_virginica = iris.where(filters=["Class not in `Iris-setosa`, `Iris-virginica`"])
- not_setosa_or_virginica
icase in
This method returns rows that contain a match of one or more values in a specified column, regardless of capitalization. In the example below, the new table virginica
contains only Iris virginica flowers.
virginica = iris.where(filters=["Class icase in `iris-virginica`"])
- virginica
icase not in
This method returns rows that do not contain a match of one or more values in a specified column, regardless of capitalization. In the example below, the new table not_versicolor
contains data for Iris setosa and virginica flowers.
not_versicolor = iris.where(filters=["Class icase not in `iris-versicolor`"])
- not_versicolor
Conditional filters
Like match filters, conditional filters use where
to filter out unwanted data. Conditional filters are used to filter data based on formulas other than those provided by match filters. These can be an arbitrary boolean formula.
Conditional filters frequently use:
=
and==
: is equal to!=
: is not equal to>
and<
: greater than and less than>=
and<=
: greater than or equal to and less than or equal to- Methods on strings (e.g.,
startsWith
,endsWith
,matches
,contains
)
Equality and inequality filtering
While filtering for equality is an example of match filtering, it becomes a conditional filter when adding other operations. In the example below, the equality filter becomes conditional when it checks the result of a modulo operation. The filter returns a table containing Iris flower data with petal width that is a multiple of 0.5 cm.
conditional_equality_filtered = iris.where(filters=["PetalWidthCM % 0.5 == 0"])
- conditional_equality_filtered
Range filtering
It's common to filter for data that falls with a range of values. Using one or more of >
, <
, >=
, <=
, and inRange
is the best way to achieve this.
In the example below, <
is used to filter by sepal width in a range. Then, inRange
is used to filter by petal width in a range.
sepal_width_less_than_three_CM = iris.where(filters=["SepalWidthCM < 3.0"])
petal_width_one_CM_or_less = iris.where(filters=["inRange(PetalWidthCM, 0, 1)"])
- sepal_width_less_than_three_CM
- petal_width_one_CM_or_less
String filtering
Methods on objects can be used to filter. Strings in Deephaven are represented as Java strings. Any methods on java.lang.String
can be called from within a query string. Methods such as startsWith
, endsWith
, contains
, and matches
can be useful for performing partial string matches.
In the two examples below, each operator is used to filter Iris data based on substring matches. startsWith
searches for a prefix, endsWith
searches for a suffix, contains
searches for a substring, and matches
searches for a regular expression match.
new_iris = iris.where(filters=["Class.startsWith(`Iris`)"])
setosa = iris.where(filters=["Class.endsWith(`setosa`)"])
- new_iris
- setosa
contains_versicolor = iris.where(filters=["Class.contains(`versicolor`)"])
matches_versicolor = iris.where(filters=["Class.matches(`.*versicolor.*`)"])
- contains_versicolor
- matches_versicolor
Combine filters
Multiple match and/or conditional statements can be combined to filter data in a table. These combinations can be either conjunctive or disjunctive.
Conjunctive filtering (AND)
Conjunctive filtering is used to return a table where all conditional filters in a where
clause return true.
In the following example, a conjunctive filter is applied to the iris
table to produce a new table of only Iris setosa flowers with a petal length in a specific range.
conjunctive_filtered_iris = iris.where(filters=["Class in `Iris-setosa`", "PetalLengthCM >= 1.3 && PetalLengthCM <= 1.6"])
- conjunctive_filtered_iris
Disjunctive filtering (OR)
Disjunctive filtering is used to return a table where one or more of the statements return true. This can be using the or_
function.
In the following example, two filters work disjunctively to return a new table where the petal length is greater than 1.9 cm or less than 1.3 cm.
or_filtered_iris = iris.where_one_of(filters=["PetalLengthCM > 1.9", "PetalWidthCM < 1.3"])
- or_filtered_iris
Filter with head
, tail
, and slice
The following methods all filter rows from a table:
head
returns a number of rows at the top of a table. When applied to a blink table, the new table will not update.tail
returns a number of rows at the bottom of a table. When applied to a blink table, the new table will not update.slice
returns a number of rows based on a start and end row index. This is not currently supported for blink tables.head_pct
returns a percentage of rows at the top of a table. This is not currently supported for blink tables.tail_pct
returns a percentage of rows at the bottom of a table. This is not currently supported for blink tables.slice_pct
returns a number of rows based on the difference between specified start and end row percentages. This is not currently supported for blink tables.
The following example illustrates using head
and tail
to return the first and last 10 rows.
iris_head = iris.head(10)
iris_tail = iris.tail(10)
- iris_head
- iris_tail
The following example uses head_pct
and tail_pct
to return the first and last 10% (15) rows. Note that the values passed into head_pct
or tail_pct
must be floating-point numbers between 0 and 1, representing percentages.
iris_head_pct = iris.head_pct(0.10)
iris_tail_pct = iris.tail_pct(0.10)
- iris_head_pct
- iris_tail_pct
The following example illustrates using slice
to return subsets of rows from a table. If the start index is 0, the result is identical to the one we got from head
. If the end index is 0, the result is identical to to the one we got from tail
.
iris_slice = iris.slice(30,120)
iris_slice_head = iris.slice(0, 10)
iris_slice_tail = iris.slice(-10, 0)
- iris_slice
- iris_slice_head
- iris_slice_tail
Negative start and/or end indices can be used as inputs to slice
. A negative index counts backwards from the end of a table. If the start index is negative, the end index must be greater than or equal to it but cannot be greater than 0.
iris_slice_negative = iris.slice(-60, -30)
- iris_slice_negative
Filter one table based on another
The where_in
and where_not_in
methods enable filtering of one table based on another table. These two methods are evaluated whenever either table passed in as input changes, whereas where
is only evaluated when the filtered table ticks.
In the example below, the where_in
and where_not_in
methods are used to find Iris virginica sepal widths that match and do not match Iris versicolor sepal widths:
virginica = iris.where(filters=["Class in `Iris-virginica`"])
versicolor = iris.where(filters=["Class in `Iris-versicolor`"])
virginica_matching_petal_widths = virginica.where_in(filter_table=versicolor, cols=["PetalWidthCM"])
virginica_non_matching_petal_widths = virginica.where_not_in(filter_table=versicolor, cols=["PetalWidthCM"])
- virginica
- versicolor
- virginica_matching_petal_widths
- virginica_non_matching_petal_widths
where_in
and where_not_in
are inefficient if the filter table updates frequently.
Unlike natural_join
, where_in
can be used when there are more than one matching value in the right table for values in the left table. This is true of join
as well, but where_in
is faster to return matching rows than join
.
where_in
only provides filtering, and does not allow adding columns from the right table. In some cases, it may be desirable to use where_in
to filter and then join
to add columns from the right table. This provides similar performance to natural_join
, while still allowing multiple matches from the right table.