How to Determine Symbol Table Efficiency
Deephaven uses symbol tables to increase the performance of queries that involve string columns. A symbol table maps each unique value in a string column in a Deephaven table to an ID (a 4-byte signed integer). By using a fixed-width ID for each unique "symbol" in the column, Deephaven can:
- Save space on disk
- More efficiently access data
Symbol tables work by storing a *.dat
file that tells Deephaven which string value (as the unique ID) is at a particular location. Deephaven then uses that value to look up the symbol offset in an offset file. The offset file tells Deephaven the byte address of the actual value in the symbol file, enabling Deephaven to look up the string value for a given ID quickly.
The Data Import Server (DIS) configuration includes string caching hints that bound the size of the in-memory symbol tables that each partition may use. This protects the DIS heap from managing an unbounded number of strings, which may cause undesirable garbage collection behavior.
Note
For more detail, see Symbol Caching for Importers.
Symbol table efficiency largely depends on the ratio of the cardinality of the symbol table to the size of the Deephaven table it is associated with. For instance, a symbol table with 100 unique IDs for a Deephaven table with 100 million rows will be more performant than a symbol table with 100 thousand unique IDs for a Deephaven table with 10 million rows.
Symbol tables can negatively affect performance if used in the wrong contexts. For instance, consider a symbol table with a cardinality greater than the size of the Deephaven table it's associated with. In such a case, the DIS creates multiple IDs for each symbol. This results in a higher disk usage, which can negatively impact the performance of queries that need to read additional data.
The following code block audits the symbol table efficiency of a table:
symTabEfficiency = { tab ->
names = []
symNames = []
symSizes = []
values = []
metaData = tab.meta().where("DataType=`java.lang.String`")
rows = metaData.getRowSet()
nameColSource = metaData.getColumnSource("Name")
iter = rows.iterator()
while (iter.hasNext()) {
long key = iter.next()
names.add(nameColSource.get(key))
}
for (String name : names) {
def col = tab.getColumnSource(name)
if (col instanceof io.deephaven.engine.table.impl.sources.regioned.SymbolTableSource) {
symtab = col.getStaticSymbolTable(tab.getRowSet(), false)
if (symtab.size() > 0) {
println "Processing " + name
symNames.add(name)
symSizes.add(symtab.size())
values.add(tab.selectDistinct(name).size())
}
}
}
return newTable(col("Name", symNames.toArray()), col("Size", symSizes.toArray()), col("Values", values.toArray())).update("Values=(long)Values", "Size=(long)Size", "Efficiency=Values/Size")
}
lastBusinessDate = io.deephaven.time.calendar.Calendars.calendar().pastBusinessDate(1)
x=symTabEfficiency.call(db.liveTable("Namespace", "Table").where("Date=`${lastBusinessDate}`"))
from deephaven.calendar import calendar
from deephaven import new_table
from deephaven.column import string_col, long_col
import jpy
SymbolTableSource = jpy.get_type(
"io.deephaven.engine.table.impl.sources.regioned.SymbolTableSource"
)
def symTabEfficiency(tab):
names = []
symNames = []
symSizes = []
values = []
metaData = tab.meta_table.where("DataType=`java.lang.String`")
rows = metaData.j_table.getRowSet()
nameColSource = metaData.j_table.getColumnSource("Name")
it = rows.iterator()
while it.hasNext():
key = it.next()
names.append(nameColSource.get(key))
for name in names:
col = tab.j_table.getColumnSource(name)
if SymbolTableSource.jclass.isInstance(col):
col = jpy.cast(col, SymbolTableSource)
symtab = col.getStaticSymbolTable(tab.j_table.getRowSet(), False)
if symtab.size() > 0:
print("Processing %s" % name)
symNames.append(name)
symSizes.append(symtab.size())
values.append(tab.select_distinct(name).size)
return new_table(
[
string_col("Name", symNames),
long_col("Size", symSizes),
long_col("Values", values),
]
).update(["Values=(long)Values", "Size=(long)Size", "Efficiency=Values/Size"])
lastBusinessDate = calendar().pastBusinessDate(1)
x = symTabEfficiency(
db.live_table("Namespace", "Table").where("Date=`%s`" % lastBusinessDate)
)
The query produces a table with one row per symbol table column. The table has the following columns:
Values
: the number of distinct values exist for that column.Size
: the size of the on-disk symbol table.Efficiency
: the ratio ofValues
toSize
, which is ideally 1.00 (or 100%) for a single partition. When you have tables with multiple partitions, the symbol tables are independent.
You may also see that the number of Values
is close to the size of the input table, in which case a symbol table column may not be appropriate (as is often the case for order IDs).
Consider the screenshot below. The symbol
, wiresymbol
, and u_symbol
columns show an efficiency of 0.03%, which is very low. Adjusting the DIS cache hints to allow sufficient headroom for the roughly 30,000 unique values is likely to improve query performance. Additionally, the ordid
column is a symbol table column, but it has over 88 million unique values. Queries and ingestion are likely to be more efficient by changing the schema to include symbolTable="none"
.
Caution
You cannot change symbolTable
to "none"
without making historical data unreadable.