How to Determine Symbol Table Efficiency

Deephaven uses symbol tables to increase the performance of queries that involve string columns. A symbol table maps each unique value in a string column in a Deephaven table to an ID (a 4-byte signed integer). By using a fixed-width ID for each unique "symbol" in the column, Deephaven can:

  • Save space on disk
  • More efficiently access data

Symbol tables work by storing a *.dat file that tells Deephaven which string value (as the unique ID) is at a particular location. Deephaven then uses that value to look up the symbol offset in an offset file. The offset file tells Deephaven the byte address of the actual value in the symbol file, enabling Deephaven to look up the string value for a given ID quickly.

The Data Import Server (DIS) configuration includes string caching hints that bound the size of the in-memory symbol tables that each partition may use. This protects the DIS heap from managing an unbounded number of strings, which may cause undesirable garbage collection behavior.

Note

For more detail, see Symbol Caching for Importers.

Symbol table efficiency largely depends on the ratio of the cardinality of the symbol table to the size of the Deephaven table it is associated with. For instance, a symbol table with 100 unique IDs for a Deephaven table with 100 million rows will be more performant than a symbol table with 100 thousand unique IDs for a Deephaven table with 10 million rows.

Symbol tables can negatively affect performance if used in the wrong contexts. For instance, consider a symbol table with a cardinality greater than the size of the Deephaven table it's associated with. In such a case, the DIS creates multiple IDs for each symbol. This results in a higher disk usage, which can negatively impact the performance of queries that need to read additional data.

The following code block audits the symbol table efficiency of a table:

symTabEfficiency = { tab ->
    names = []
    symNames = []
    symSizes = []
    values = []
    metaData = tab.meta().where("DataType=`java.lang.String`")
    rows = metaData.getRowSet()
    nameColSource = metaData.getColumnSource("Name")
    iter = rows.iterator()
    while (iter.hasNext()) {
        long key = iter.next()
        names.add(nameColSource.get(key))
    }

    for (String name : names) {
        def col = tab.getColumnSource(name)
        if (col instanceof io.deephaven.engine.table.impl.sources.regioned.SymbolTableSource) {
            symtab = col.getStaticSymbolTable(tab.getRowSet(), false)
            if (symtab.size() > 0) {
                println "Processing " + name
                symNames.add(name)
                symSizes.add(symtab.size())
                values.add(tab.selectDistinct(name).size())
            }
        }
    }
    return newTable(col("Name", symNames.toArray()), col("Size", symSizes.toArray()), col("Values", values.toArray())).update("Values=(long)Values", "Size=(long)Size", "Efficiency=Values/Size")
}
lastBusinessDate = io.deephaven.time.calendar.Calendars.calendar().pastBusinessDate(1)
x=symTabEfficiency.call(db.liveTable("Namespace", "Table").where("Date=`${lastBusinessDate}`"))
from deephaven.calendar import calendar
from deephaven import new_table
from deephaven.column import string_col, long_col
import jpy

SymbolTableSource = jpy.get_type(
    "io.deephaven.engine.table.impl.sources.regioned.SymbolTableSource"
)


def symTabEfficiency(tab):
    names = []
    symNames = []
    symSizes = []
    values = []
    metaData = tab.meta_table.where("DataType=`java.lang.String`")
    rows = metaData.j_table.getRowSet()
    nameColSource = metaData.j_table.getColumnSource("Name")
    it = rows.iterator()
    while it.hasNext():
        key = it.next()
        names.append(nameColSource.get(key))

    for name in names:
        col = tab.j_table.getColumnSource(name)
        if SymbolTableSource.jclass.isInstance(col):
            col = jpy.cast(col, SymbolTableSource)
            symtab = col.getStaticSymbolTable(tab.j_table.getRowSet(), False)
            if symtab.size() > 0:
                print("Processing %s" % name)
                symNames.append(name)
                symSizes.append(symtab.size())
                values.append(tab.select_distinct(name).size)

    return new_table(
        [
            string_col("Name", symNames),
            long_col("Size", symSizes),
            long_col("Values", values),
        ]
    ).update(["Values=(long)Values", "Size=(long)Size", "Efficiency=Values/Size"])


lastBusinessDate = calendar().pastBusinessDate(1)
x = symTabEfficiency(
    db.live_table("Namespace", "Table").where("Date=`%s`" % lastBusinessDate)
)

The query produces a table with one row per symbol table column. The table has the following columns:

  • Values: the number of distinct values exist for that column.
  • Size: the size of the on-disk symbol table.
  • Efficiency: the ratio of Values to Size, which is ideally 1.00 (or 100%) for a single partition. When you have tables with multiple partitions, the symbol tables are independent.

You may also see that the number of Values is close to the size of the input table, in which case a symbol table column may not be appropriate (as is often the case for order IDs).

Consider the screenshot below. The symbol, wiresymbol, and u_symbol columns show an efficiency of 0.03%, which is very low. Adjusting the DIS cache hints to allow sufficient headroom for the roughly 30,000 unique values is likely to improve query performance. Additionally, the ordid column is a symbol table column, but it has over 88 million unique values. Queries and ingestion are likely to be more efficient by changing the schema to include symbolTable="none".

img

Caution

You cannot change symbolTable to "none" without making historical data unreadable.