Skip to main content

How to use filters

This guide discusses how to properly use filters to exclude unwanted data from analysis in Deephaven. Topics covered include match and conditional filtering, conjunctive and disjunctive filtering, and filtering with head and tail.

For this how-to guide, we'll use the example data found in Deephaven's example repository. Follow the instructions in the README to download the data to the proper location for use with Deephaven.

note

In this guide, we read data from a mounted directory at the base of the Docker container. See Docker data volumes to learn more about the relation between locations in the container and the local file system.

To illustrate filtering in Deephaven, we'll use the Iris data set from the examples. This data set contains observations about Iris flowers from R. A. Fisher's classic 1936 paper, "The Use of Multiple Measurements in Taxonomic Problems". The paper describes categorizing plant varieties by using observable metrics. The data is often used to demonstrate machine learning categorization algorithms.

from deephaven.TableTools import readCsv
iris = readCsv("/data/examples/Iris/csv/iris.csv")

This produces the iris table, which has five columns and 150 rows. The first four columns contain Iris measurement data, while the fifth column, Class, is the Iris species name. The image below shows the first few entries:

Next, we'll show various ways to filter the data.

Match filters#

Match filters use where to filter out unwanted data. They come in six different flavors:

equals (= and ==)#

This method returns rows that have a matching value in a specified column. In the example below, the new table filteredBySepalWidth contains only the rows from the iris table with a 3.5 cm sepal width.

filteredBySepalWidth = iris.where("SepalWidthCM = 3.5")
note

The single equals (=) and double equals (==) can be used interchangeably in filters.

in#

This method returns rows that contain a match of one or more values in a specified column. In the example below, the new table filteredByClass contains only Iris setosa and virginica flowers.

setosaAndVirginica = iris.where("Class in `Iris-setosa`, `Iris-virginica`")

not in#

This method returns rows that do not contain a match of one or more values in a specified column. In the example below, the new table versicolor contains only Iris versicolor flowers.

notSetosaOrVirginica = iris.where("Class not in `Iris-setosa`, `Iris-virginica`")

icase in#

This method returns rows that contain a match of one or more values in a specified column, regardless of capitalization. In the example below, the new table virginica contains only Iris virginica flowers.

virginica = iris.where("Class icase in `iris-virginica`")

icase not in#

This method returns rows that do not contain a match of one or more values in a specified column, regardless of capitalization. In the example below, the new table notVersicolor contains data for Iris setosa and viriginca flowers.

notVersicolor = iris.where("Class icase not in `iris-versicolor`")

Conditional filters#

Like match filters, conditional filters use where to filter out unwanted data. Conditional filters are used to filter data based on formulas other than those provided by match filters. These can be an arbitrary boolean formula.

Conditional filters frequently use:

  • = and ==: is equal to
  • !=: is not equal to
  • > and <: greater than and less than
  • >= and <=: greater than or equal to and less than or equal to
  • Methods on strings (e.g. startsWith, endsWith, matches, contains)

Equality and inequality filtering#

While filtering for equality is an example of match filtering, it becomes a conditional filter when adding other operations. In the example below, the equality filter becomes conditional when it checks the result of a modulo operation. The filter returns a table containing Iris flower data with petal width that is a multiple of 0.5 cm.

conditionalEqualityFiltered = iris.where("PetalWidthCM % 0.5 == 0")

Range filtering#

It's common to filter for data that falls with a range of values. Using one or more of >, <, >=, <=, and inRange is the best way to achieve this.

In the example below, < is used to filter by sepal width in a range. Then, inRange is used to filter by petal width in a range.

sepalWidthLessThanThreeCM = iris.where("SepalWidthCM < 3.0")petalWidthOneCMorLess = iris.where("inRange(PetalWidthCM, 0, 1)")

String filtering#

Methods on objects can be used to filter. Strings in Deephaven are represented as Java strings. Any methods on java.lang.String can be called from within a query string. Methods such as startsWith, endsWith, contains, and matches can be useful for performing partial string matches.

In the two examples below, each operator is used to filter Iris data based on substring matches. startsWith searches for a prefix, endsWith searches for a suffix, contains searches for a substring, and matches searches for a regular expression match.

newIris = iris.where("Class.startsWith(`Iris`)")setosa = iris.where("Class.endsWith(`setosa`)")
containsVersicolor = iris.where("Class.contains(`versicolor`)")matchesVersicolor = iris.where("Class.matches(`.*versicolor.*`)")

Combine filters#

Multiple match and/or conditional statements can be combined to filter data in a table. These combinations can be either conjunctive or disjunctive.

Conjunctive filtering (AND)#

Conjunctive filtering is used to return a table where all conditional filters in a where clause return true.

In the following example, a conjunctive filter is applied to the iris table to produce a new table of only Iris setosa flowers with a petal length in a specific range.

conjunctiveFilteredIris = iris.where("Class in `Iris-setosa`", "PetalLengthCM >= 1.3 && PetalLengthCM <= 1.6")

Disjunctive filtering (OR)#

Disjunctive filtering is used to return a table where one or more of the statements return true. This can be achieved by either using the logical OR operator (||) or using whereOneOf.

In the following example, two filters work disjunctively to return a new table where the petal length is greater than 1.9 cm or less than 1.3 cm.

orFilteredIris = iris.where("PetalLengthCM > 1.9 || PetalWidthCM < 1.3")whereOneOfFilteredIris = iris.whereOneOf("PetalLengthCM > 1.9", "PetalWidthCM < 1.3")

Filter with head and tail#

​ The filtering methods head, tail, headPct, and tailPct all filter the top or bottom rows of a table:

  • head and tail return a specified number of rows.
  • headPct and tailPct return a percent of the table. ​ The following example illustrates using head and tail to return the first and last 10 rows. ​ ​
iris_head = iris.head(10)iris_tail = iris.tail(10)

​ The following example illustrates using headPct and tailPct to return the first and last 10% (15) rows. ​

iris_head_pct = iris.headPct(0.10)iris_tail_pct = iris.tailPct(0.10)

Filter one table based on another#

​ The whereIn and whereNotIn methods enable filtering of one table based on another table. These two methods are evaluated whenever either table passed in as input changes, whereas where is only evaluated when the filtered table ticks. ​ In the example below, the whereIn and whereNotIn methods are used to find Iris virginica sepal widths that match and do not match Iris versicolor sepal widths:


virginica = iris.where("Class in `Iris-virginica`")versicolor = iris.where("Class in `Iris-versicolor`")virginica_matching_petal_widths = virginica.whereIn(versicolor, "PetalWidthCM")virginica_non_matching_petal_widths = virginica.whereNotIn(versicolor, "PetalWidthCM")

caution

whereIn and whereNotIn are inefficient if the filter table updates frequently. ​

tip

Unlike naturalJoin, whereIn can be used when there are more than one matching value in the right table for values in the left table. This is true of join as well, but whereIn is faster to return matching rows than join.

whereIn only provides filtering, and does not allow adding columns from the right table. In some cases, it may be desirable to use whereIn to filter and then join to add columns from the right table. This provides similar performance to naturalJoin, while still allowing multiple matches from the right table.

Related documentation#