Skip to main content
Version: Python

When should I use `where_in` or `natural_join`?

When should I use where_in versus natural_join in my queries?

Generally, it's best to use where_in when your list of values (your right table) stays relatively stable. This is because the where_in call maintains a minimal amount of state compared to natural_join. With where_in, the set of unique values in the right table is maintained in memory. Only the Index of rows that currently match the where_in filter is maintained for the left table. When the valid set changes, the columns that you are using to filter the left table must be scanned. When the right table changes infrequently, this is a useful trade-off. The left table is seldom scanned, and less memory and computation are used on each update.

natural_join maintains a state entry for all values in both the left and right tables. This includes all keys that currently pass the filter, but also all keys that do not pass the filter. For each key, an Index data structure is maintained for the left and right side. When a value enters the right table, all of the matching left-hand side rows are available and can be augmented with the corresponding right-hand side row. When a value leaves the right table, all of the matching left-hand side rows are available, and the augmentation can be removed. This means that processing each individual left-hand side update is more expensive, but the left-hand table doesn't have to be scanned when the right-hand side table changes.

Traders or firms often have a fairly stable set of symbols in which they have a position. Using a where_in clause to filter other tables to those positions is generally efficient. The query doesn't have to maintain excess state for the rare cases where valid symbols change. However, full table scans are required when the positions do change.

On the other hand, the set of open order IDs changes frequently. In this case, filtering a table by open order ID using where_in would require frequent scans. A natural_join would require more in-memory state, but would be able to compute the results more incrementally.

note

These FAQ pages contain answers to questions about Deephaven Community Core that our users have asked in our Community Slack. If you have a question that is not in our documentation, join our Community and we'll be happy to help!