Data Indexes

This guide covers what data indexes are, how to create and use them, and how they can improve query performance. A data index maps key values in one or more key columns to the row sets containing the relevant data. Data indexes are used to improve the speed of data access operations in tables. Many table operations will use a data index if it is present. Every data index is backed by a table, with the index stored in a column called dh_row_set.

Data indexes can improve the speed of filtering operations. Additionally, data indexes can be useful if multiple query operations need to compute the same data index on a table. This is common when a table is used in multiple joins or aggregations. If the table does not have a data index, each operation will internally create the same index. If the table does have a data index, the individual operations will not need to create their own indexes and can execute faster and use less RAM.

Create a data index

A data index can be created from a source table and one or more key columns using getOrCreateDataIndex:

import static io.deephaven.engine.table.impl.indexer.DataIndexer.getOrCreateDataIndex

source = emptyTable(10).update("X = i")
sourceIndex = getOrCreateDataIndex(source, "X")

Important

When a new data index is created, it is not immediately computed. The index is computed when a table operation first uses it or the table method is called on the index. This is an important detail when trying to understand performance data.

Every data index is backed by a table. The backing table can be retrieved with the table method:

result = sourceIndex.table()

The source table in the above example doesn't create a very interesting data index. Each key value in X only appears once. The following example better illustrates a data index:

import static io.deephaven.engine.table.impl.indexer.DataIndexer.getOrCreateDataIndex

source = emptyTable(10).update("X = 100 + i % 5")
sourceIndex = getOrCreateDataIndex(source, "X").table()

When looking at source, we can see that the value 100 in column X appears in rows 0 and 5. The value 101 appears in rows 1 and 6, and so on. Each row in dh_row_set contains the rows where each unique key value appears. Let's look at an example for more than one key column.

import static io.deephaven.engine.table.impl.indexer.DataIndexer.getOrCreateDataIndex

source = emptyTable(20).update("X = i % 4", "Y = i % 2")
sourceIndex = getOrCreateDataIndex(source, "X", "Y").table()

There are only four unique keys in X and Y together. The resultant column dh_row_set shows the row sets in which each unique key appears.

All of the previous examples create tables with flat data indexes. A flat data index is one in which the row sets are sequential. However, it is not safe to assume that all data indexes are flat, as this is uncommon in practice. Consider the following example where the data index is not flat:

import static io.deephaven.engine.table.impl.indexer.DataIndexer.*

source = emptyTable(25).update("Key = 100 + randomInt(0, 4)", "Value = randomDouble(0, 100)")

sourceIndex = getOrCreateDataIndex(source, "Key").table()

result = sourceIndex.where("Key in 100, 102")

Retrieve and verify a data index

You can verify whether a data index exists for a particular table using one of two methods:

  • hasDataIndex: Returns a boolean indicating whether a data index exists for the table and the given key column(s).
  • getDataIndex: Returns a data index if it exists. Returns null it does not, if the cached index is invalid, or if the refreshing cached index is no longer live.

The following example creates a data index for a table and checks if data indexes exist for different key columns:

import static io.deephaven.engine.table.impl.indexer.DataIndexer.*

source = emptyTable(100).update(
    "Timestamp = '2024-04-25T12:00:00Z' + (ii * 1_000_000)",
    "Key1 = ii % 8",
    "Key2 = ii % 11",
    "Key3 = ii % 19",
    "value = -ii"
)

index13 = getOrCreateDataIndex(source, "Key1", "Key3")

hasIndex1 = hasDataIndex(source, "Key1")
hasIndex23 = hasDataIndex(source, "Key2", "Key3")
hasIndex13 = hasDataIndex(source, "Key1", "Key3")

println "There is a data index for source on Key 1: ${hasIndex1}"
println "There is a data index for source on Keys 2 and 3: ${hasIndex23}"
println "There is a data index for source on Keys 1 and 3: ${hasIndex13}"


def index2 = getDataIndex(source, "Key2")
def hasIndex2AfterGet = hasDataIndex(source, "Key2")

println "There is a data index for source on Key 2: ${hasIndex2AfterGet}"

Usage in queries

Warning

Data indexes are weakly reachable. Assign the index to a variable to ensure it does not get garbage collected.

If a table has a data index and an engine operation can leverage it, it will be automatically used. The following table operations can use data indexes:

The following subsections explore different table operations. The code blocks below use the following tables:

import static io.deephaven.engine.table.impl.indexer.DataIndexer.*

source = emptyTable(100_000).update(
    "Timestamp = '2024-04-25T12:00:00Z'",
    "Key1 = (int)i % 11",
    "Key2 = (int)i % 47",
    "Key3 = (int)i % 499",
    "Value = (int)i"
)

sourceRight = emptyTable(400).update(
    "Timestamp = '2024-04-25T12:00:00Z'",
    "Key1 = (int)i % 11",
    "Key2 = (int)i % 43",
    "Value = (double)(int)i / 2.0"
)

sourceRightDistinct = sourceRight.selectDistinct("Key1", "Key2")

Filter

The engine will use single and multi-column indexes to accelerate exact match filtering. Range filtering does not benefit from an index.

Note

The Deephaven engine only uses a DataIndex when the keys exactly match what is needed for an operation. For example, if a data index is present for the columns X and Y, it will not be used if the engine only needs an index for column X.

The following filters can use the index created atop the code block:

sourceK1Index = getOrCreateDataIndex(source, "Key1")

resultK1 = source.where("Key1 = 7")
resultK1In = source.where("Key1 in 2,4,6,8")

The following example can only use the index for the first filter:

sourceK1Index = getOrCreateDataIndex(source, "Key1")

resultK1K2 = source.where("Key1 = 7", "Key2 = 11")

This is true for filtering one table based on another as well. The first filter in the following example can use sourceK1K2Index, whereas the second example cannot use sourceRightK1Index because the index does not match the filter:

sourceK1K2Index = getOrCreateDataIndex(source, "Key1", "Key2")
sourceRightK1Index = getOrCreateDataIndex(sourceRight, "Key1")

resultWhereIn = source.whereIn(sourceRightDistinct, "Key1", "Key2")
resultWhereNotIn = source.whereNotIn(sourceRightDistinct, "Key1")

Join

Different joins will use indexes differently:

MethodCan use indexes?Can use left table index?Can use right table index?
exactJoin✅✅🚫
naturalJoin✅✅🚫
aj✅✅✅
raj✅✅✅
join🚫🚫🚫
leftOuterJoin🚫🚫🚫
fullOuterJoin🚫🚫🚫
multiJoin🚫🚫🚫
rangeJoin🚫🚫🚫

Natural join

When performing a naturalJoin without data indexes, the engine performs a linear scan of the source (left) table's rows. Each key is hashed and a map is created to the specified joining (right) table. If a data index exists for the source (left) table, the engine skips the scan and uses the index directly to create the mapping.

Consider a naturalJoin on source and sourceRight. The following example can use sourceK1K2Index to accelerate the join, but not sourceRightK1K2Index because the operation cannot use the right table index:

sourceK1K2Index = getOrCreateDataIndex(source, "Key1", "Key2")
sourceRightK1K2Index = getOrCreateDataIndex(sourceRight, "Key1", "Key2")

resultNaturalJoined = source.naturalJoin(
    sourceRight,
    "Key1, Key2",
    "RhsValue=Value"
)

As-of join

The exact match in an aj or an raj can be accelerated using data indexes, eliminating the task of grouping the source table rows by key. The following example can use the indexes created because they match the exact match column used in the join:

sourceK1Index = getOrCreateDataIndex(source, "Key1")
sourceRightK1Index = getOrCreateDataIndex(sourceRight, "Key1")

resultAsOfJoined = source.aj(
    sourceRight,
    "Key1, Timestamp >= Timestamp",
    "RhsValue=Value, RhsTimestamp=Timestamp"
)

Sort

Sorting operations can be accelerated using both full and partial indexes.

  • A full index matches all sorting columns.
  • A partial index matches a subset of sorting columns.

When a full index is present, the engine only needs to sort the index rows.

A sort operation can only use a partial index if the partial index matches the first sorting column. When a sort operation uses a partial index, the engine sorts the first column using the index and then sorts any remaining columns naturally.

The following sort operation uses the index because it matches the sort columns (ordering does not matter):

sourceK1K2Index = getOrCreateDataIndex(source, "Key1", "Key2")

resultSortedK2K1 = source.sortDescending("Key2", "Key1")

The following sort operation uses the index when sorting the first key column only:

sourceK1Index = getOrCreateDataIndex(source, "Key1")

resultSortedK1K3 = source.sort("Key1", "Key3")

The following sort operation does not use any indexes, as the only partial index (on Key1) does not match the first sort column:

sourceK1Index = getOrCreateDataIndex(source, "Key1")

resultSortedK3K1 = source.sort("Key3", "Key1")

Aggregate

Aggregations without data indexes require the engine to group all similar keys and perform the calculation on subsets of rows. The index allows the engine to skip the grouping step and immediately begin calculating the aggregation. The following aggregation uses the index because it matches the aggregation key columns:

import io.deephaven.api.agg.Aggregation

sourceK1K2Index = getOrCreateDataIndex(source, "Key1", "Key2")

aggregations = [
    AggSum("Sum=Value"),
    AggAvg("Avg=Value"),
    AggStd("Std=Value")
]

resultAgged = source.aggBy(aggregations, "Key1", "Key2")

Persist data indexes with Parquet

Deephaven can save data indexes along with your data when you write a table to Parquet files. By default, it saves all available indexes. You can also choose to save only some indexes. If you try to save an index that doesn't exist yet, Deephaven will create a temporary index just for the saving process. When you load data from a Parquet file, it also loads any saved indexes.

The following example writes a table plus indexes to Parquet and then reads a table plus indexes from the Parquet files.

import static io.deephaven.engine.table.impl.indexer.DataIndexer.*
import io.deephaven.engine.liveness.*
import io.deephaven.parquet.table.ParquetTools
import io.deephaven.parquet.table.ParquetInstructions
import io.deephaven.util.SafeCloseable

source = emptyTable(100_000).update(
    "Timestamp = '2024-04-25T12:00:00Z' + (ii * 1_000_000)",
    "Key1 = ii % 11",
    "Key2 = ii % 47",
    "Key3 = ii % 499",
    "value = ii"
)


// Write to disk and specify two sets of index key columns
writeInstructions = ParquetInstructions.builder().addIndexColumns("Key1").addIndexColumns("Key1", "Key2").build()
ParquetTools.writeTable(source, "/data/indexed.parquet", writeInstructions)

// Load the table and the indexes from disk
diskTable = ParquetTools.readTable("/data/indexed.parquet")

// Show that the table loaded from disk has indexes
println hasDataIndex(diskTable, "Key1")
println hasDataIndex(diskTable, "Key1", "Key2")

scope = new LivenessScope()

// Open a new liveness scope so the indexes are released after writing
try (SafeCloseable ignored = LivenessScopeStack.open(scope, false)) {
    indexKey1 = getOrCreateDataIndex(source, "Key1")
    indexKey1Key2 = getOrCreateDataIndex(source, "Key1", "Key2")

    // Write to disk - indexes are automatically persisted
    ParquetTools.writeTable(source, "/data/indexed.parquet")
}

// Load the table and the indexes from disk
diskTableNew = ParquetTools.readTable("/data/indexed.parquet")

// Show that the table loaded from disk has indexes
println hasDataIndex(diskTable, "Key1")
println hasDataIndex(diskTable, "Key1", "Key2")

Performance

Data indexes improve performance when used in the right context. If used in the wrong context, data indexes can make performance worse. The following list provides guidelines to consider when determining if data indexes will improve performance:

  • A data index is more likely to help performance if the data within an index group is located near (in memory) other data in the group. This avoids index fragmentation and poor data access patterns.
  • Reusing a data index multiple times is necessary to overcome the cost of initially computing the index.
  • Generally speaking, the higher the key cardinality, the better the performance improvement.