How efficient are my table selection operations?

How much memory do my select and update operations waste? Are they efficient?

select and update create in-memory columns of data, whereas view, updateView, and lazyUpdate do not.

Deephaven recommends using select and update when calculations are expensive, or the results are needed for multiple downstream operations. These methods can cause issues when data gets sufficiently large, as they can consume a lot of memory.

How efficient are your select and update operations? This page gives you a query to run in the Deephaven console to find out.

To better understand how efficient your memory usage is, execute the following Groovy code:

import gnu.trove.TCollections
import gnu.trove.set.hash.TLongHashSet
import gnu.trove.map.hash.TObjectIntHashMap
import gnu.trove.impl.Constants
import io.deephaven.engine.table.impl.sources.SparseArrayColumnSource
import io.deephaven.util.type.TypeUtils

simpleSizeLookupBuilder = new TObjectIntHashMap<>(Constants.DEFAULT_CAPACITY, Constants.DEFAULT_LOAD_FACTOR, 0)
simpleSizeLookup = TCollections.unmodifiableMap(simpleSizeLookupBuilder)

accumulateBlocks = { long rowKey, TLongHashSet longSet -> longSet.add(rowKey >> 10) }
getSimpleDataRecordSizeInBytes = { type -> simpleSizeLookup.get(TypeUtils.getBoxedType(type)) }

checkSparsity = { bindingVarEntry, list ->
    def varName = bindingVarEntry.getKey()
    def varValue = bindingVarEntry.getValue()
    if (varValue instanceof Table) {
        def s = varValue.getColumnSources().stream()
        s = s.filter({cs -> cs instanceof SparseArrayColumnSource})
        s = s.map({cs -> cs.getType()})
        s = s.mapToInt({cst -> int result = getSimpleDataRecordSizeInBytes(cst); return result == 0 ? 8 : result; })
        colByteSizeSum = s.sum()
        if (colByteSizeSum > 0) {
            def longSet = new TLongHashSet()
            long idealBlocks = ((varValue.getRowSet().size() + 1023) / 1024)
            varValue.getRowSet().forAllRowKeys({rowKey -> accumulateBlocks(rowKey, longSet)})
            long actualBlocks = longSet.size()
            double efficiency = (idealBlocks / actualBlocks) * 100
            println varName + ": ideal blocks=" + idealBlocks + ", actual blocks=" + actualBlocks + ", efficiency=" + efficiency + "%"
            list.add([varName, efficiency])
        }
    }

}

bindingVars = new HashMap(getBinding().getVariables())
resultList = []
bindingVars.entrySet().forEach({e -> checkSparsity(e, resultList)})
resultList.sort({e1, e2 -> -1 * Long.compare(e1.get(1), e2.get(1))})
println resultList

An efficiency of 100% means that the selection operations are perfectly efficient - there is no wasted memory. This is the ideal case.

Tip: Perfect efficiency is not always possible, especially with complex or refreshing tables. Aim for higher efficiency, but don't worry if you can't reach 100% in every case.

An important detail about this code is that it checks memory efficiency for columns backed by a SparseArrayColumnSource. This does not apply to all columns. The simplest way to construct a table with columns backed by these is to create a non-flat refreshing table with in-memory columns. You can force all columns into memory with select.

t1 = timeTable("PT1s").update("X = randomDouble(-1, 1)").where("X > -0.75")
t1: ideal blocks=1, actual blocks=1, efficiency=100.0%
[[t1, 100.0]]

Note

A "flat" table has a flat row set. That means the row set increases monotonically from 0 to the number of rows minus one with no gaps.