Editing schemas with Column Tools
The ColumnTools
class provides methods to modify or remove historical column data. Currently, it supports only Deephaven format historical data, as Parquet files are not as conducive to replacing or modifying individual columns.
Note
This method is only available in the Groovy API.
Using ColumnTools
, you can:
- Update the contents of existing historical columns.
- Populate a new column for existing historical partitions.
- Change the datatype of an existing column.
- Remove a column and its data files.
- Rename a column and its data files.
Like other data writing operations (data merge queries, etc.), ColumnTools
methods must be executed from the context of a merge server worker, so the underlying process will have write access to historical data files.
General considerations
When processing updates, ColumnTools
methods check several prerequisites. These are principally:
- All partitions to be updated are using Deephaven format.
- All internal partitions to be updated are writable (see the Table Storage topic for details on writable vs. read-only internal partitions).
- The column being updated or deleted is not a grouping or partitioning column.
- The data type of new values matches the data type of the column being written.
For multi-partition updates, the ColumnTools
class first writes all of the new column files to hidden temporary directories and then, when all the writing is complete, moves the files to their permanent locations. The files' locations and their destinations are persisted to a state file on disk when all the writing is complete. This state file is also used to prevent multiple concurrent column changes to a table. See the troubleshooting section for steps to recover or cancel an update that was interrupted or failed after creating its state file.
Caution
No other write operations should occur on a historical table concurrently with ColumnTools
modifications. This mainly means merge operations. ColumnTools
changes should only be run at a time when they will not risk interacting with scheduled or ad-hoc merges of the table in question.
ColumnTools
updates historical data and table schema, but does not update intraday data. If there is also unmerged intraday data, this data may be unreadable after changing a column data type. ColumnTools
changes to tables that have intraday data should be done while related ingesters/tailers/loggers are offline. In some cases it will be necessary to delete and reimport intraday data after modifying a table with ColumnTools
.
After modifying a table with ColumnTools
, workers may have out-of-date details about the table. A refresh using db.reloadData()
may be sufficient, but, in some cases, workers may need to be restarted to pick up new data and schema details.
Initialize the class
To use ColumnTools
, the implementation class must be imported and a new instance instantiated. In Legacy Groovy:
import com.illumon.iris.importers.ColumnToolsImpl;
columnTools = new ColumnToolsImpl(log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db);
This is a basic instantiation. It creates a ColumnTools
class that uses the logger, fatal error reporter, and database from the worker environment.
A more complex instantiation would be something like:
import com.illumon.iris.importers.ColumnToolsImpl;
columnTools = new ColumnToolsImpl(log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db,
null,
10);
This example passes two extra arguments.
- The first is a
com.illumon.util.progress.StatusCallback
. This allows theColumnTools
methods to send progress updates as they write columns. In this case, though, since null is passed, a non-reporting status callback is created internally. - The second argument is more interesting as it defines the number of threads allocated for concurrent write operations. To achieve maximum throughput, set this number equal to the number of internal partitions in the table being updated. This applies to column write operations. Other operations, such as deleting column files, happen on a single thread since they do not require much time.
Using Python requires a few extra steps:
- The logger that is automatically available in Groovy must be explicitly created in Python.
jpy.get_type()
is used to import classes.None
is used instead ofnull
.
ColumnToolsImpl = jpy.get_type("com.illumon.iris.importers.ColumnToolsImpl")
Level = jpy.get_type("com.fishlib.io.log.LogLevel").INFO
Logger = jpy.get_type("com.fishlib.io.logger.NullLoggerImpl")
log = Logger(Level)
fatalErrorReporter = jpy.get_type(
"com.fishlib.util.process.ProcessEnvironment"
).getGlobalFatalErrorReporter()
columnTools = ColumnToolsImpl(log, fatalErrorReporter, db, None, 10)
Update the contents of existing columns
Updating the contents of a column and populating a new column are handled the same way with ColumnTools
. The difference lies in the scenarios. When a new column is added to a table with existing historical data, the column will be null for partitions that have already been written. The updateColumn
method can be used to fill the column for these historical partitions. Similarly, if a column contains data that needs to be updated or corrected, the updateColumn
method can be used for that purpose as well.
updateColumn
has three user-facing forms:
.updateColumn(namespace, tableName, partitioningColumnValue, columnName, updateStatement)
This form allows for specifying a namespace, table name, column name, and the partitioning column value (typically Date) for which an update should be run. The last argument is a Deephaven updateView
expression executed to provide the new values for the column.
This example updates the Sample table in the NS1 namespace to set the Value2 column to the result of Value1 times 2:
columnTools.updateColumn("NS1","Sample","Value2","Value2=Value1*2");
When the partitioningColumnValue
argument is omitted, all historical partitions that exist for this table are updated.
.updateColumn(namespace, tableName, columnName, updateStatement)
When the logic to produce values for the column being updated is more complex than what can be done in a simple updateView
, use the third form of updateColumn
that takes a sourceTable
instead of an updateStatement
:
.getSourceTable(namespace, tableName, partitioningColumnValue)
.updateColumn(namespace, tableName, partitioningColumnValue, columnName, sourceTable)
To use this form, first call getSourceTable
to retrieve a version of the table with internal partitioning information. A table can then be created from this one, using any capabilities within the Deephaven query language (joins, function calls, etc.), and is passed back to updateColumn
to provide the source for the new values.
The sourceTable passed into updateColumn
must meet several requirements relative to the original provided by getSourceTable
:
- The internal partition column (named
X_Internal_Partition_X
by default) must be preserved. - The row count must be the same.
- The number of internal partitions must be the same.
- The number of rows per internal partition must be the same.
- The data cannot have been re-sorted.
- It must contain the column being updated.
- The data type of the column being updated must be unchanged.
Most of these requirements are checked by the updateColumn
method call, and an exception is thrown when one of them is found not to be met.
Here is a simple Legacy example of getting a source table:
import com.illumon.iris.importers.ColumnToolsImpl
columnTools = new ColumnToolsImpl(
log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db,
null,
10)
st = columnTools.getSourceTable("LearnDeephaven","StockQuotes","2017-08-25")
This example returns a source table which can then be updated with complex table operations such as joins to populate columns with data to be used in subsequent updateColumn
calls.
To update an entire historical table with logic that requires the use of getSourceTable
, it is necessary to write a loop that iterates over selectDistinct(<partitioningColumn>)
values from the table, gets a source table, operates on it, and then calls updateColumn
.
Here is an exmaple outline of a looping source table updateColumn
call:
import com.illumon.iris.importers.ColumnToolsImpl
// This example is partial, and assumes an AdjPrice column has been added to the schema
String namespace = "LearnDeephaven"
String tableName = "StockTrades"
String columnToUpdate = "AdjPrice"
columnTools = new ColumnToolsImpl(
log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db,
null,
10)
dates = db.t(namespace, tableName).selectDistinct("Date").getColumn(0).getDirect()
for (String date in dates) {
Table st = columnTools.getSourceTable("LearnDeephaven","StockQuotes",date)
// Do some table operations on st to calculate AdjPrice values
// Update a column in the current date parition from the modified version in st
columnTools.updateColumn(
namespace,
tableName,
date,
columnToUpdate,
st)
}
Change the data type of a column
The changeColumnType
method is intended for cases where existing data can be converted into a new type using a simple conversion expression (e.g., converting int
to long
, or long
to int
when all values are small enough for int
). If more complex table operations are needed, you may need to add a new column, populate it using the sourceTable
form of updateColumn
, and then remove the old column with deleteColumn
.
Internally, when changing a column's data type, ColumnTools
iterates through the existing historical partitions and writes temporary column files in the new format. Once all files have been written, they are moved to replace the old files, and the schema is updated to reflect the new data type.
.changeColumnType(namespace, tableName, columnName, newDataType, updateStatement)
For example:
import com.illumon.iris.importers.ColumnToolsImpl;
columnTools = new ColumnToolsImpl(log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db,
null,
10);
columnTools.changeColumnType("NS1","Sample","Value2",int.class,"Value2=(int)Value2");
- This example changes the Value2 column of the Sample table in the NS1 namespace to the int primitive type.
- The fourth argument (
newDataType
) is a Java class of the new type. - The fifth argument (
updateStatement
) is similar to theupdateStatement
argument used inupdateColumn
. This expression must provide a value that matches the data type to which the column is being changed. In some cases, such as this one, a simple cast will work. In others, more complex expressions may be needed - for instance, to convert a String value to a numeric or DateTime value.
Delete a column
The deleteColumn
method removes column data files from existing historical partitions and deletes the column from the schema. This process is simpler and faster than changing a column's data type, and it follows a different order of operations. Since the column will be removed regardless, the method first removes the column from the schema and then iterates through the existing historical partitions to delete the column files. This method call will be rejected if there is already a columnUpdate
running for the table.
Note that if this method is executed on a column that does not exist in the schema, it will still run and attempt to remove any data files on disk associated with the column. This allows the method to be easily rerun if it is interrupted and also enables it to clean up column files left over after a column has been previously removed from the schema using other methods.
.deleteColumn(namespace, tableName, columnName)
The following example deletes the Value2 column from the Sample table in the NS1 namespace, and also deletes all existing historical column files for this column.
import com.illumon.iris.importers.ColumnToolsImpl;
columnTools = new ColumnToolsImpl(
log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db);
columnTools.deleteColumn("NS1","Sample","Value2");
Rename a column
The renameColumn
method changes a column name in the schema and for existing historical column files on disk.
.renameColumn(namespace, tableName, columnName, newColumnName)
Like deleteColumn
, this method updates the schema first and then modifies column files, since this should be a much faster process than writing new files. In case data is accessed from a table while not all files have yet been renamed, partitions that still have the old names for column files will return null values. As it is for other ColumnTools
operations, though, it is strongly recommended to have no other processes attempting to read data from the table while renaming a column.
The new column name must be unique to the table and a valid Deephaven column name. If it is not, an IllegalArgumentException
is thrown with details of the failure in validating the new column name.
renameColumn
will also rename Logger/Listener columns and system inputs, and ImportSource columns. Note, however, that if an ImportSource column being renamed includes a formula, that formula will need to be updated if it includes any references to the old column name.
Note
Because renaming a column also renames corresponding Logger/Listener entries, log files created with the pre-rename version of the schema will not be replayable or loadable (e.g. with readBin
) with the post-rename schema. To allow support for older log files, manually add a versioned Listener section that includes mappings from pre to post rename columns.
The following example renames the Value2 column and its corresponding historical column data files to SomeNewColumnName
.
import com.illumon.iris.importers.ColumnToolsImpl;
columnTools = new ColumnToolsImpl(
log,
com.fishlib.util.process.ProcessEnvironment.getGlobalFatalErrorReporter(),
db);
columnTools.renameColumn("NS1","Sample","Value2","SomeNewColumnName");
Troubleshooting and recovery steps
Most operations through ColumnTools
will validate state and data before making changes. Problems such as invalid column names and most types of malformed input in a sourceTable
will throw exceptions that provide clear details about why the request could not be processed.
An IllegalStateException
, with little additional detail about what caused the exception, may occur if ColumnTools
operations are attempted from a query worker instead of a merge worker, because query workers do not have write access to column files.
One erroneous scenario which cannot be detected automatically is that if a sourceTable
reorders rows for an updateColumn
(e.g., via a sort or aggregation), then the updateColumn
will run, but the updated column will have incorrect values. The system ignores the other columns in the sourceTable
which differ from the table on disk, therefore the newly constructed table will have mismatched values.
When processing an updateColumn
for all partitions, or changeColumnType
, the execution proceeds in these phases:
- Write a state file
<table_name>_ColumnUpdate.map
to the root of the historical namespace typically/db/Systems/<namespace>/
. - Write new column files for all partitions to hidden temporary directories.
- Update the state file with a map of temporary and permanent directories for the table.
- Iterate through all partitions from the map in the state file and replace original column files with the new ones from the hidden temporary directories.
- If changing column type: update the schema with the new type.
- Delete the state file.
All columnUpdate
forms use the state file on disk to prevent multiple concurrent ColumnTools
updates to a table. renameColumn
and deleteColumn
do not create a state file, but will reject a request if a state file is found to exist when they start.
In the case where the process is interrupted in step 2, nothing has yet changed for the permanent table, and a fresh execution of the update/change statement can start over from the beginning. In the case where the process is interrupted in step 4, some files may have already been replaced, and the overall change should be resumed at step 4 to complete the update.
The restartFinalization
method is used to restart steps 4 - 6, or, if the state file is empty, to remove it so a new update can be started.
restartFinalization(namespace, tableName)
Other ColumnTools
change methods, like renameColumn
and deleteColumn
, are designed to be idempotent in case they are interrupted while making file system changes. When they are re-run after interruption, renameColumn
attempts to find and rename any remaining column files with the old name, and deleteColumn
attempts to find and delete any remaining column files for the column that has been deleted.
In addition to failure events, which result in thrown exceptions, the ColumnTools
methods log a fair amount of step detail as Info messages. These can be viewed in the ProcessEventLog, or by enabling Info logging in the Log panel of a console.
In the most severe failure scenarios, a fatal error may occur in the column writing code; this will kill the worker. More detail can be obtained about such a failure by querying the ProcessEventLog
for the worker in question and sorting by Timestamp (since the fatal error reporter events will likely be out of order relative to regular worker events in the log). Events of interest in this case will typically be Errors, and will be very near the max Timestamp values.