Data Indexes
Important
This feature is currently experimental. The API and performance characteristics are subject to change.
This guide covers what data indexes are, how to create and use them, and how they can improve query performance. A data index maps key values in one or more key columns to the row sets containing the relevant data. Data indexes are used to improve the speed of data access operations in tables. Many table operations will use a data index if it is present. Every data index is backed by a table, with the index stored in a column called dh_row_set.
Data indexes can improve the speed of filtering operations. Additionally, data indexes can be useful if multiple query operations need to compute the same data index on a table. This is common when a table is used in multiple joins or aggregations. If the table does not have a data index, each operation will internally create the same index. If the table does have a data index, the individual operations will not need to create their own indexes and can execute faster and use less RAM.
Create a data index
A data index can be created from a source table and one or more key columns using data_index:
Important
When a new data index is created, it is not immediately computed. The index is computed when a table operation first uses it or the table attribute is called on the index. This is an important detail when trying to understand performance data.
Every data index is backed by a table. The backing table can be retrieved with the table attribute:
The source table in the above example doesn't create a very interesting data index. Each key value in X only appears once. The following example better illustrates a data index:
When looking at source, we can see that the value 100 in X appears in rows 0 and 5. The value 101 appears in rows 1, 6, and so on. Each row in dh_row_set contains the rows where each unique key value appears. Let's look at an example for more than one key column.
There are only four unique keys in X and Y together. The resultant column dh_row_set shows the row sets in which each unique key appears.
All of the previous examples create tables with flat data indexes. A flat data index is one in which the row sets are sequential. However, it is not safe to assume that all data indexes are flat, as this is uncommon in practice. Consider the following example where the data index is not flat:
Retrieve and verify a data index
You can verify whether a data index exists for a particular table using one of two methods:
has_data_index: Returns a boolean indicating whether a data index exists for the table and the given key column(s).data_index: Ifcreate_if_absentisFalse, it returns the existing data index if there is one orNoneif there is no index.
The following example creates data indexes for a table and checks if they exist based on different key columns. has_data_index is used to check if the index exists.
Usage in queries
Warning
Data indexes are weakly reachable. Assign the index to a variable to ensure it does not get garbage collected.
If a table has a data index and an engine operation can leverage it, it will be automatically used. The following table operations can use data indexes:
- Join:
- Group and aggregate:
- Filter:
- Select:
- Sort:
The following subsections explore different table operations. The code blocks below use the following tables:
Filter
The engine will use single and multi-column indexes to accelerate exact match filtering. Range filtering does not benefit from an index.
Note
The Deephaven engine only uses a DataIndex when the keys exactly match what is needed for an operation. For example, if a data index is present for the columns X and Y, it will not be used if the engine only needs an index for column X.
The following filters can use the index created atop the code block:
The following example can only use the index for the first filter:
This is true for filtering one table based on another as well. The first filter in the following example can use source_k1_k2_index, whereas the second example cannot use source_right_k1_index because the index does not match the filter:
Join
Different joins will use indexes differently:
| Method | Can use indexes? | Can use left table index? | Can use right table index? |
|---|---|---|---|
exact_join | ✅ | ✅ | 🚫 |
natural_join | ✅ | ✅ | 🚫 |
aj | ✅ | ✅ | ✅ |
raj | ✅ | ✅ | ✅ |
join | 🚫 | 🚫 | 🚫 |
left_outer_join | 🚫 | 🚫 | 🚫 |
full_outer_join | 🚫 | 🚫 | 🚫 |
multi_join | 🚫 | 🚫 | 🚫 |
range_join | 🚫 | 🚫 | 🚫 |
Natural join
When performing a natural_join without data indexes, the engine performs a linear scan of the source (left) table's rows. Each key is hashed and a map is created to the specified joining (right) table. If a data index exists for the source (left) table, the engine skips the scan and uses the index directly to create the mapping.
Consider a natural_join on source and source_right. The following example can use source_k1_k2_index to accelerate the join, but not source_right_k1_k2_index because the operation cannot use the right table index:
As-of join
The exact match in an aj or an raj can be accelerated using data indexes, eliminating the task of grouping the source table rows by key. The following example can use the indexes created because they match the exact match column used in the join:
Sort
Sorting operations can be accelerated using both full and partial indexes.
- A full index matches all sorting columns.
- A partial index matches a subset of sorting columns.
When a full index is present, the engine only needs to sort the index rows.
A sort operation can only use a partial index if the partial index matches the first sorting column. When a sort operation uses a partial index, the engine sorts the first column using the index and then sorts any remaining columns naturally.
The following sort operation uses the index because it matches the sort columns (ordering does not matter):
The following sort operation uses the index when sorting the first key column only:
The following sort operation does not use any indexes, as the only partial index (on Key1) does not match the first sort column:
Aggregate
Aggregations without data indexes require the engine to group all similar keys and perform the calculation on subsets of rows. The index allows the engine to skip the grouping step and immediately begin calculating the aggregation. The following aggregation uses the index because it matches the aggregation key columns:
Persist data indexes with Parquet
Deephaven can save data indexes along with your data when you write a table to Parquet files. By default, it saves all available indexes. You can also choose to save only some indexes. If you try to save an index that doesn't exist yet, Deephaven will create a temporary index just for the saving process. When you load data from a Parquet file, it also loads any saved indexes.
The following example writes a table plus indexes to Parquet and then reads a table plus indexes from the Parquet files.
Performance
Data indexes improve performance when used in the right context. If used in the wrong context, data indexes can make performance worse. The following list provides guidelines to consider when determining if data indexes will improve performance:
- A data index is more likely to help performance if the data within an index group is located near (in memory) other data in the group. This avoids index fragmentation and poor data access patterns.
- Reusing a data index multiple times is necessary to overcome the cost of initially computing the index.
- Generally speaking, the higher the key cardinality, the better the performance improvement.
Benchmark scripts
If you're interested in how data indexes impact query performance, the scripts below collect performance metrics for each operation. You will notice that performance improves in some instances and is worse in others.
This first script sets up the tests:
Here's a follow-up script that runs tests for both low- and high-cardinality cases, both with and without data indexes:
It can also be helpful to plot the results: