Query operation errors
Deephaven query operations can fail for many reasons. This guide highlights the most common errors and how to debug them.
Missing columns
Query operations in Deephaven often reference columns in one of two ways:
-
Columns within formulas: Operations like
where,select,update,view, andupdate_viewuse column names within formula strings. When these column names are incorrect, you need to debug formula compilation errors. -
Column names as direct arguments: Other operations take column names directly as arguments. For example, this
last_byoperation uses two columns -SerialNumberandVersion- passed as a list of string arguments:
This script results in the following NoSuchColumnException:
In this case, the column Version used in the last_by is actually named VersionNumber, as shown in the list of columns in the input table (the result of our merge operation). Here is the corrected code:
Many other operations can have missing columns. For example, a natural join may have a missing column on the right-hand side:
Column names are case-sensitive. This query produces the following error because the correct name is HeapSizeInGB rather than HeapSizeGb:
The following query functions properly:
Similarly, changing the natural join to have an additional match column (in this example, ProcessInfoId) that is not present on the right-hand side produces the following error:
And conversely, adding ScriptLanguage, which is not present on the left-hand side, errors as follows:
Duplicate columns in Joins
When using joins, you may encounter errors related to duplicate or conflicting column names in the joined tables. The following example shows what happens when you try to include columns with the same name from both tables:
This produces the following error:
To resolve this, rename the conflicting column using an alias in your join operation:
Duplicate row keys
Natural join operations require that a unique right-hand side row exists for each key. If multiple rows exist, the Deephaven engine cannot unambiguously match a right-hand side row to the left-hand side row. If multiple outputs for each row are desired, then use a join operation rather than a natural_join.
The following example shows what happens when you try to use natural join with keys that aren't unique in the right-hand table:
This produces the following error message:
In this case, the query with a serial number 1733244619742000074 has more than one row in the PersistentQueryConfigurationLogV2 table.
However, the SerialNumber column does not define a unique key for a Persistent Query configuration. In this case, the query with a serial number 1733244619742000074 has more than one row in the PersistentQueryConfigurationLogV2 table.
To correct this, we must add a second key column, in this case, VersionNumber, to uniquely identify the version of the query that is referenced by the Persistent Query State log:
This query correctly maps the state to a single version of the Persistent Query configuration, but there are still cases where multiple rows can appear, resulting in a similar error but indicating the version number in the duplicate key:
In particular, when the Persistent Query Controller restarts it logs all active configurations. To correct this problem, select only the last row from the PersistentQueryConfigurationLogV2 table to retrieve the most recent entry:
Every data set and query is different, therefore adding a last aggregation is not a universal solution. Carefully consider your data and necessary business logic to ensure that you choose the correct data.
Resource limitations
Sometimes, a query operation exceeds the resources available. Three kinds of resource limitations are memory, key cardinality, and row set address space.
Out of memory
Each Persistent Query or Code Studio is assigned to a JVM that has limited memory available for heap space (where most Java objects are stored) and direct memory (where I/O buffers are often stored). Beyond the JVM's managed memory, when using Python, the Python interpreter also uses memory for its own objects. The JVM can report that heap is exhausted, which manifests itself as an OutOfMemoryError.
Using a worker with a 1GB heap, the following snippet creates a 150,000,000 row table and uses the select method to create the column X, whose type is long. Since long values are each 8 bytes, this select operation requires 1.2GB of heap storage to store the column.
In this particular instance, the Code Studio crashed due to resource exhaustion, so the Exception text must be retrieved from the ProcessEventLog.
A Java heap exception can also be generated from a Groovy console entirely from script code, without using any Deephaven operations. For example:
This produces the following exception:
Python memory allocations are not bound by the JVM heap size, but you may still run into system limits.
For example:
This results in the following Python error:
You may not always receive such a clear error message. If the available machine memory is exceeded, the Linux kernel may terminate your worker with an "OOM". On Kubernetes, the runtime terminates pods that exceed their memory limits, and you may need to adjust your additional memory settings.
Join or aggregation cardinality
Deephaven uses hash tables to bucket values in a join or aggregation, which have a maximum size of roughly 750,000,000 million rows. When the cardinality of the hash keys is greater than the maximum size, an error is thrown. For example, this query has a key cardinality of one billion:
This results in the following exception:
You can work around this limitation by using partitioned tables. A partitioned table divides an input table into several sub-tables; in this case, we divide the billion-row table into smaller tables with only 100 million keys each. Often these divisions are done using a hash code and modulus operator. After dividing the table, we create a proxy, execute the original count, and then merge the results. When dividing the table, it is essential that each division contains all of the rows for your aggregation or join keys.
Row set address space
The Deephaven engine represents tables as a set of column sources and a rowset that indicates what addresses in each column source are valid. This enables the engine to efficiently handle tables that update. The rowset contains row keys that are represented as a signed 64-bit number, leaving 63-bits for representing row keys. Some Deephaven operations subdivide those 63-bits into regions. For example, a table returned from db.live_table or db.historical_table uses the top 23 bits for the partition and the bottom 40 bits for the row within a partition. This division allows the engine to very quickly identify the correct partition and row when reading data. The ungroup and join operations make similar use of the row set address space to efficiently access rows.
The following example uses a series of group_by and ungroup operations to create a Table with a large address space. At each intermediate step, it calls a function to print out the current size of the table, and the last row key with the number of bits consumed.
The number of bits in the intermediate tables address space is as follows:
The ungrouped table contains only 50,000,000 rows, but uses 40-bits of address space. During the execution of the join, there are too many row keys to efficiently represent the result, therefore the engine produces the following exception message:
Insert the flatten of the left side as suggested in the exception message as follows:
The introduction of the flatten operation reduces the 40 bits of address space used by the ungrouped table to only 26 bits used by the flatten table. The join operation can efficiently represent the result.
Engine assertions
At some points while executing a query, the Deephaven engine can detect invalid internal state. In these cases, the engine produces an AssertionFailure exception. For example, when an update cannot both add and modify the same row key, or the number of rows in an aggregation state cannot be less than zero. An assertion failure indicates one of two conditions:
- The query violates the assumptions made by the Deephaven engine. In particular, queries may not introduce changes into the system without the knowledge of the Deephaven Update Graph. Common examples of changes that are unknown to the engine include unstable formulas or using array references in a ticking table. An unstable formula may not return the same value for the same inputs (i.e., it is not a pure function). Never use an unstable function, like
now(), in anupdate_view. - There is a bug in the Deephaven engine itself.
The following query incorrectly uses an unstable formula, which causes undefined behavior in Deephaven's engine.
After running for some period of time, the query fails with an assertion like the following:
The "Missing value in probe" indicates that while processing an aggregation the engine could not find an expected state.
As is often the case in these kinds of problems, the operation that caused the error (here, the incorrect use of update_view to generate a random number) is not the same operation that detected the error (the count_by bucketed on the Unstable column). Indeed, many errors may not be detected at all or only by an operation very far downstream from the offending error (whether that error is due to the user's query violating engine assumptions or a bug in the Deephaven engine itself). This makes this kind of failure among the most difficult type of error to debug.
There are some tools available that may help in tracking down the root cause of an assertion failure:
- The
TableUpdateValidatoris a tool that can narrow down which operation is at fault for a failure. TheTableUpdateValidatorcan be inserted into a sequence of operations and validates that the update from an operation correctly notifies the downstream operation of all changes. This can be used to detect unstable formulas or engine bugs that result in erroneous previous values. - Attaching a debugger to the query and setting a breakpoint on an
AssertionFailure. This requires that the assertion is reliably reproducible and knowledge of the engine internals and Java debugging. - Generating a heap dump on assertion failures, and analyzing the result using JVisualVM or other tools. Analyzing a heap dump requires the same knowledge of engine internals and Java debugging and provides less information, but this can be enabled more easily when the failure recurs only occasionally.
Use a TableUpdateValidator
To narrow down our problem, you can use a TableUpdateValidator. From Groovy, use the TableUpdateValidator directly. From Python, use the jpy.get_type method to access the underlying Java class and then re-wrap the result in a deephaven.table.Table.
In this case, the update produced the following error:
This exonerates the count method, indicating that the table z is producing inconsistent values for the Unstable column. Although this shows that z is producing incorrect results, y has not actually been identified as the root cause. Setting the JVM argument -DTableUpdateValidator.aggressiveUpdateValidation=true and re-running the query produces the following exception:
It is now clear that the update_view operation in y produces incorrect results for the Unstable column. Table Update Validators are useful tools, but they can greatly affect a query's performance and memory usage. It records all values from the table that it is validating in memory and then compares those values with updated results. This impacts both the query's heap utilization and the update cycle's CPU utilization, so a Table Update Validator should be used judiciously.