Intraday Binary Log format configuration

The binary log format configuration is used to generate loggers and listeners for the production and consumption of streaming data. The configuration is contained inside the schema as a LogFormat XML element.

High-level overview

In simple cases, most of the relevant details for configuration can be inferred from the schema itself.

<Table namespace="MyNamespace" name="MyTableName" storageType="NestedPartitionedOnDisk">
  <Partitions keyFormula="${autobalance_single}" />
  <Column name="Date" dataType="String" columnType="Partitioning" />
  <Column name="Host" dataType="String" />
  <Column name="Value" dataType="int" />
  <LogFormat version="1" />
</Table>

In the example above, the user is responsible for setting the version attribute. This is the minimum configuration needed to create a Deephaven LogFormat listener. The listener will recognize that, in version 1, both Host and Value will be present in the binary log files.

Deephaven can also auto-generate a logger according to a given LogFormat version by adding one or more Logger blocks with the required class attribute:

<Table namespace="MyNamespace" name="MyTableName" storageType="NestedPartitionedOnDisk">
  <Partitions keyFormula="${autobalance_single}" />
  <Column name="Date" dataType="String" columnType="Partitioning" />
  <Column name="Host" dataType="String" />
  <Column name="Value" dataType="int" />
  <LogFormat version="1">
      <Logger class="com.example.MyLogger" />
  </LogFormat>
</Table>

It is usually unnecessary to generate more than one Logger for a given version. For instance, consider a scenario where the LogFormat is used in two different contexts: one where the logger aggregates values from multiple hosts, and another where it operates from a single host. In the example below, this can be achieved by adding an additional Logger block with a Param element:

<Table namespace="MyNamespace" name="MyTableName" storageType="NestedPartitionedOnDisk">
  <Partitions keyFormula="${autobalance_single}" />
  <Column name="Date" dataType="String" columnType="Partitioning" />
  <Column name="Host" dataType="String" />
  <Column name="Value" dataType="int" />
  <LogFormat version="1">
      <!-- this logger can specify a different Host for every row -->
      <Logger class="com.example.MyLogger" />
      <!-- this logger must specify a single Host at logger construction time -->
      <Logger class="com.example.MyLoggerSingleHost">
          <Param columnName="Host" constant="true" />
      </Logger>
  </LogFormat>
</Table>

LogFormat element

The LogFormat element specifies the high-level details for the binary log format.

Attribute	Meaning	Default	Notes
`version`	The format of the generated log, which must match the format used in the listener.	-	Required
`maxHeaderSize`	The maximum size for a header entry.	4 KiB

LogFormat/Encoding elements

The Encoding elements specify the details for the binary log format on a column-by-column basis. You do not usually need to specify Encoding elements for all columns (except for blob columns, which require a codec). The data types written to the log are automatically the same as the data types in the schema definition. To provide additional control over logger generation, the following attributes are available:

Attribute	Meaning	Default	Notes
`columnName`	The name of the column the attributes apply to.	-	Required
`type`	The type of the column.	`normal`	Must be one of `normal`, `ignore`, `deleted`, `tailer_tx_time`, `dis_rx_time`, or `row_size`.
`renamedFrom`	If this column (that exists in the schema) was renamed, the old name that should be used for function arguments in the log files.	-
`precision`	The precision of the timestamp written to the log.	`nanos`	May be `seconds`, `millis`, `micros` or `nanos`. Nanos is preferred for newly defined binary logs, but existing logs may use millis or micros.
`objectCodec`	The name of the `ObjectCodec` that should be used for this column. This is required for object columns that are not String or temporal types.	-
`objectCodecArguments`	The arguments for the the `ObjectCodec` that should be used for this column.	-
`encoding`	The encoding to use for string values.	`ISO_8859_1`	Inherited from the schemas column if not specified.
`dataType`	If the column was deleted, then the `dataType` of the column as it previously existed.	-	Required for `type="deleted"` columns, invalid otherwise.

LogFormat/Logger elements

The Logger elements specify the high-level details relevant for code-generated loggers.

Attribute	Meaning	Default	Notes
`type`		`buffered`	Must be one of `buffered` or `encoders`.
`class`	The name of the output class.		Required
`includeRowFlags`	Include row flags in the log method definition. If not included, all rows are logged with the single row flag.	false
`maxEntrySize`	The maximum size for a single entry.	1 MiB	Must be at most configuration property `BinaryStoreMaxEntrySize`. The default is inherited as the configuration property `BinaryStoreMaxEntrySize`.
`argumentOrder`	Whether generated methods will use the column order specified in the `Table` element or the column order specified in the `Logger` element (followed by the remaining columns in the `LogFormat` and `Table` element).	`schema`	Must be one of `schema` or `logger`.

The following attributes are only relevant when type="buffered":

Attribute	Meaning	Default	Notes
`interface`	The name of an interface to be implemented by the generated class.	-
`threadSafe`	If the generated logger will be thread-safe.	true	It is only safe to set this to false when the caller will be using an external synchronization mechanism, or only calling the logger from a single thread.
`columnPartitionArgument`	The name of the argument for passing in the column partition to write to.	-
`timePartitionColumn`	The name of the Column used for generating the column partition to write to.	-	The `columnPartitionArgument` and `timePartitionColumn` attributes are mutually exclusive. If either of these attributes is specified, then the logger uses dynamic partitions. If neither of these attributes is specified, then the logger does not manage column partitions, leaving that up to the buffer writer used to initialize it. Also see the discussion on time zones if you're doing this.
`bufferSize`	The buffer size each buffered logger will maintain.	2 MiB	Must be at least 2x `maxEntrySize`. Default inherited as 2x the configuration property `BinaryStoreMaxEntrySize`.

Time zones and generated column partition values

When using generated column partition values for the Partitioning column using the columnPartitionArgument argument, the time zone of the worker or Java process is taken into account. It's important to ensure that the time zone matches the use case for the data.

If no time zone is specified, then the system's default time zone is used. On most Linux distributions you can find this with ls -l /etc/localtime.
Java's java.util.TimeZone documentation provides further information on how Java processes determine their time zones. For example, specifying the JVM parameter -Duser.timezone=America/Los_Angeles makes the process use the Los Angeles time zone to generate column partition values.

If the time zone doesn't line up with user expectations, you may find that data isn't in the partitions you expect.

For instance, if you're operating on New York time but your system uses the UTC zone (GMT), data will be written into the next date's partition as soon as it's midnight in UTC, four or five hours before it's midnight in New York depending on the time of year.
It's even possible for the same time to get written into multiple column partition dates if different workers specify their own user.timezone properties.

LogFormat/Logger/Param elements

The Param elements specify the details for the code-generated logger on a column-by-column basis.

Attribute	Meaning	Default	Notes
`columnName`	The name of the column the attributes apply to.	-	Required
`constant`	The column is a constant value.	false
`inputType`	The type of input parameter to the `log` method for this column.	-	This is only valid for temporal column types, with possible values `long`, `java.time.Instant`, `java.time.ZonedDateTime`, `com.illumon.iris.db.tables.utils.DBDateTime`, or `DateTime`.
`source`	The `ObjectInput` that this column is derived from.	-
`precision`	The precision of a long timestamp argument to the `log` method.	`nanos`	May be `seconds`, `millis`, `micros`, or `nanos`.
`maxLoggedSize`	The maximum size, in bytes, that can be written to the log file for this column.	-	This is valid for Strings and Blob columns that use a codec. When present, if an encoded value exceeds the limit, the generated logger throws an `IOException`.
`stringStrategy`	The strategy used to encode String values. This is only valid for String columns. Must be one of `bytes` or `encoder`.	`encoder`

LogFormat/Logger/ObjectInput element

You may also provide ObjectInput child elements that define parameters to the log method. This makes it simpler to log many fields from a single object. The ObjectInfo element supports the following attributes:

Attribute	Meaning	Default	Notes
`name`	The name of the `log` method parameter, referenced in the Column `source` attribute.	-	Required
`class`	The type of the object, which must be available to the factory when generating the log.	-	Required
`mixinClass`	An additional type that can be used to provide annotations for determining the correct getters.	-
`nullable`	The object passed to `log` may be null, in which case the log method fills in all columns derived from this object with null values.	false

LogFormat/ImportState element

Listeners can have an additional child element of ImportState, which has the following attribute:

Attribute	Meaning	Notes
`class`	The class name of the import state object.	Required

LogFormat/ImportState/Column element

The ImportState element contains child elements named Column. These columns are passed to the import state onNewRow call.

Attribute	Meaning	Notes
`name`	The column name	Required

Listeners

For each log format there must be an unambiguous listener. If an old-style LoggerListener or Listener block is present with the desired logFormat, that block is used. If no old-style block is present, then a suitable new style block is used if it has the desired format. In contrast to IntradayLoggerFactory-generated loggers, the version attribute is required. This eliminates any ambiguity in which the Listener or LogFormat element should be used to process a file. When the input log format is zero, the old or new style block with the highest logFormat or version is used. To preserve legacy log formats, you may create a LogFormat block without any Logger elements which allows you to read old formats without the need to generate a logger.

Schema evolution

It is often necessary to update previous versions of LogFormat elements when a schema is updated. Consider the following schema:

<Table namespace="MyNamespace" name="MyTable" storageType="NestedPartitionedOnDisk">
    <Partitions keyFormula="${autobalance_single}"/>

    <Column name="Date" dataType="String" columnType="Partitioning" />
    <Column name="Col1" dataType="int" />
    <Column name="Col2" dataType="long" />

    <LogFormat version="1">
        <Logger class="com.example.binlog.gen.MyTableLogger" />
    </LogFormat>
</Table>

If we update the schema by deleting Col2 and adding Col3, we need to update log format version 1 if we need to read binary logs written with that format:

<Table namespace="MyNamespace" name="MyTable" storageType="NestedPartitionedOnDisk">
    <Partitions keyFormula="${autobalance_single}"/>

    <Column name="Date" dataType="String" columnType="Partitioning" />
    <Column name="Col1" dataType="int" />
    <Column name="Col3" dataType="double" />

    <LogFormat version="1">
        <!-- Col2 was deleted from the schema in version 2 but it is still relevant for this version's binary log format. -->
        <Column name="Col2" type="deleted" dataType="int" />
        <!-- Col3 was added to the schema in version 2 and we need to make sure this version does not inherit it. -->
        <Column name="Col3" type="ignore" />
    </LogFormat>
    <LogFormat version="2">
        <Logger class="com.example.binlog.gen.MyTableLogger" />
    </LogFormat>
</Table>

Examples

This simplified example is from the Deephaven DbInternal.ProcessEventLog schema. The logger requires a "Date" input to determine the partition to write to, and uses constant values for many fields that do not change per worker. These fields are written once to the header. Only the remaining columns are necessary to pass into each log method call. The Timestamp column has additional attributes to determine the type of input to the log method and the precision of the method input and log output.

<Table name="ProcessEventLog" namespace="DbInternal" storageType="NestedPartitionedOnDisk">
  <Partitions keyFormula="${autobalance_by_first_grouping_column}" />
  <Column name="Date" dataType="String" columnType="Partitioning" />
  <Column name="Timestamp" dataType="DateTime" />
  <Column name="Host" dataType="String" columnType="Grouping" />
  <Column name="Level" dataType="String" />
  <Column name="Process" dataType="String" />
  <Column name="ProcessInfoId" dataType="String" />
  <Column name="AuthenticatedUser" dataType="String" />
  <Column name="EffectiveUser" dataType="String" />
  <Column name="LogEntry" dataType="String" symbolTable="None" encoding="UTF_8" />
  <LogFormat version="2">
    <Encoding columnName="Timestamp" precision="millis" />
    <Logger class="io.deephaven.enterprise.binlog.internal.gen.PelLogger" columnPartitionArgument="Date">
      <Param columnName="Host" constant="true" />
      <Param columnName="Process" constant="true" />
      <Param columnName="ProcessInfoId" constant="true" />
      <Param columnName="AuthenticatedUser" constant="true" />
      <Param columnName="EffectiveUser" constant="true" />
      <Param columnName="Timestamp" inputType="long" precision="millis" />
      <Param columnName="LogEntry" stringStrategy="ENCODER" />
    </Logger>
  </LogFormat>
</Table>

The generated log method has the following signature:

public void log(final String Date, final long Timestamp, final String Level, final String LogEntry) throws IOException;

The generated static of method for construction:

public static PelLogger of(final boolean flushOnLog, final MultiPartitionWriter writer) throws IOException;

The generated static header method to help with writer construction:

public static ByteBuffer header(final String Host, final String Process, final String ProcessInfoId, final String AuthenticatedUser, final String EffectiveUser);

Here is an example from the persistent query state log that uses ObjectInputs. Some fields are derived from the "config" parameter, which uses a mixin for annotations. The PersistentQueryState object supplies most values. ControllerHost and Timestamp are provided as primitive inputs to the log method.

<Table name="PersistentQueryStateLog" namespace="DbInternal" storageType="NestedPartitionedOnDisk">
  <Partitions keyFormula="${autobalance_single}" />
  <Column name="Date" dataType="String" columnType="Partitioning" />
  <Column name="Owner" dataType="String" />
  <Column name="Name" dataType="String" />
  <Column name="Timestamp" dataType="DateTime" />
  <Column name="Status" dataType="String" />
  <Column name="ControllerHost" dataType="String" />
  <Column name="DispatcherHost" dataType="String" />
  <Column name="ServerHost" dataType="String" />
  <Column name="WorkerName" dataType="String" />
  <Column name="ProcessInfoId" dataType="String" />
  <Column name="WorkerPort" dataType="int" />
  <Column name="LastAuthenticatedUser" dataType="String" />
  <Column name="LastEffectiveUser" dataType="String" />
  <Column name="SerialNumber" dataType="long" columnType="Grouping" />
  <Column name="VersionNumber" dataType="long" />
  <Column name="TypeSpecificState" dataType="String" />
  <Column name="ExceptionMessage" dataType="String" />
  <Column name="ExceptionStackTrace" dataType="String" />
  <Column name="EngineVersion" dataType="String" />
  <Column name="WorkerKind" dataType="String" />
  <Column name="ScriptLoaderState" dataType="String" />
  <Column name="DispatcherPort" dataType="int" />
  <Column name="ReplicaSlot" dataType="int" />
  <Column name="StatusDetails" dataType="String" />
    <LogFormat version="6">
      <Encoding columnName="Timestamp" precision="millis" />
      <Logger class="io.deephaven.enterprise.binlog.internal.gen.PersistentQueryStateLoggerImpl" interface="io.deephaven.enterprise.controller.logger.PersistentQueryStateLogger" timePartitionColumn="Timestamp">
        <ObjectInput name="state" class="com.illumon.iris.controller.PersistentQueryState" nullable="true" />
        <ObjectInput name="config" class="com.illumon.iris.controller.PersistentQueryConfiguration" mixinClass="com.illumon.iris.controller.PersistentQueryConfigurationMixinForStateLog" />
        <Param columnName="Owner" source="config" />
        <Param columnName="Name" source="config" />
        <Param columnName="Timestamp" inputType="com.illumon.iris.db.tables.utils.DBDateTime" />
        <Param columnName="Status" source="state" />
        <Param columnName="DispatcherHost" source="state" />
        <Param columnName="ServerHost" source="state" />
        <Param columnName="WorkerName" source="state" />
        <Param columnName="ProcessInfoId" source="state" />
        <Param columnName="WorkerPort" source="state" />
        <Param columnName="LastAuthenticatedUser" source="state" />
        <Param columnName="LastEffectiveUser" source="state" />
        <Param columnName="SerialNumber" source="config" />
        <Param columnName="VersionNumber" source="config" />
        <Param columnName="TypeSpecificState" source="state" />
        <Param columnName="ExceptionMessage" source="state" />
        <Param columnName="ExceptionStackTrace" source="state" />
        <Param columnName="EngineVersion" source="state" />
        <Param columnName="WorkerKind" source="config" />
        <Param columnName="ScriptLoaderState" source="state" />
        <Param columnName="DispatcherPort" source="state" />
        <Param columnName="ReplicaSlot" source="state" />
        <Param columnName="StatusDetails" source="state" />
      </Logger>
      <ImportState class="com.illumon.iris.db.tables.dataimport.logtailer.ImportStateRowCounter" />
    </LogFormat>
</Table>

The use of ObjectInputs simplifies calling the log method, which only requires four parameters:

public void log(final PersistentQueryState state, final PersistentQueryConfiguration config, final DBDateTime Timestamp, final String ControllerHost) throws IOException;

The generated static of method for construction takes an additional DateTimeFormatter due to the timePartitionColumn:

public static PersistentQueryStateLoggerImpl of(final boolean flushOnLog, final MultiPartitionWriter writer, final DateTimeFormatter partitionFormatter) throws IOException;

Because there are no constant fields, the header method takes no fields:

public static ByteBuffer header();

ObjectInput Search Rules and Annotations

If a column is derived from an ObjectInput, then the factory automatically selects the most appropriate method or field from the source object. The io.deephaven.enterprise.binlog.annotations.LogColumn annotation can be added to a field or method to indicate that the named column should be derived from that field or method.

For a method to be eligible for matching, it must be public, and have no parameters. Fields must be public. Priority is given to:

Annotated methods or fields. You may only have one annotated method or field for a given name.
Methods with the same name as the field.
Methods that are named "get" followed by the field name. For booleans, methods that are named "is" or "has" followed by the field name.

If more than one item from the highest priority category matches, then the result is ambiguous and the code must be fixed. If the class is ambiguous or the default matching rules do not meet your requirements, then you should use an annotation in the class or mix-in to unambiguously define the field or method use.

An annotation can be provided as follows, which logs the ScriptLoaderState column using the result of the getScriptLoaderStateJson method.

@LogColumn(name="ScriptLoaderState")
public String getScriptLoaderStateJson() {
    return scriptLoaderStateJson;
}

Instead of annotating the input type (e.g., because it is a third party type, or different loggers map its getters differently, or multiple input types share the same pattern), you may create an abstract class or interface with the @LogColumn annotations.
The generator scans the mixin type for annotations, associates column names to method or field names, and applies them as if the input type itself was annotated. The mixin definition does not even require a dependency on the input type.

Casing

Casing is ignored when determining accessor candidates, so the accessor below is automatically processed as an accessor candidate for column "Price", i.e. without a LogColumn annotation.

public double price() {
    return price;
}

Also, if both a Price and price accessor are present, logger generation fails and the ambiguity is reported, requiring you to e.g. specify a LogColumn annotation, which leads to a safer result. For example:

@Deprecated
public double Price() {
    return price;
}

@LogColumn(name="Price")
public double price() {
    return Double.min(0, price);
}

Generating a Logger

You may generate a logger from the schema using the dhctl tool's logger subcommand. In this example, the logger for the internal PersistentQueryStateLog is written to ~/code/project/src/main/java/io/deephaven/binlog/internal/gen/V2PersistentQueryStateLogger.java:

dhctl logger generate --directory ~/code/project/src/main/java --table-name DbInternal.PersistentQueryStateLog

When a type="buffered" logger defines an interface attribute and that interface is on the class path, logger generation validates that the generated logger implements that interface. The --interface-validation argument allows the caller to configure interface validation.

The generated LogFormat logger has very few Deephaven dependencies, and can operate in a Java 8 or higher environment (the Deephaven server requires Java 17 or newer LTS versions). To use the logger, you should include the "support" and "channels" dependencies from Deephaven's "iris" group.

Example Usage

package io.deephaven.enterprise.binlog.test;

import io.deephaven.enterprise.binlog.channels.RollingFileWriterConfig;
import io.deephaven.enterprise.binlog.test.gen.V2Java8LoggerExample1;
import io.deephaven.enterprise.binlog.writer.SinglePartitionWriter;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.time.Instant;

public class ExampleLogger {
    public static void main(final String[] args) throws IOException {
        final String internalPartition = "Java8Test-" + System.currentTimeMillis();
        final Path outputDirectory = Path.of(args[0]);
        final String columnPartition = "ColumnPart";
        final boolean flushOnLog = false;

        // This example creates one file per hour in the outputDirectory, using a Column partition of "ColumnPart" and
        // an internal partition derived from the current time. You must pass in all columns that are defined as a constant.
        try (
                final SinglePartitionWriter writer = RollingFileWriterConfig
                        .builder(V2Java8LoggerExample1.loggerInfo())
                        .header(V2Java8LoggerExample1.header("Lima", 123))
                        .directory(outputDirectory)
                        .internalPartition(internalPartition)
                        .build()
                        .columnPartition(columnPartition);
                final V2Java8LoggerExample1 logger = V2Java8LoggerExample1.of(flushOnLog, writer)) {
            // Log a couple rows of data
            logger.log("Golf", "Alpha", 1, 2.0, 3.3f, (short) 4, (byte) 5,
                    Instant.ofEpochMilli(System.currentTimeMillis()),
                    8, '9', false);
            logger.log("Golf2", "Alpha2", -1, -2.0, -3.3f, (short) 42, (byte) 52,
                    Instant.ofEpochMilli(System.currentTimeMillis()),
                    88, '99', true);

            // Flush any outstanding data to disk (in this case, both rows will be written at the same time)
            logger.flush();

            // try block will close the logger, no rows may be written after this point
        }
    }
}

Binary log format