Intraday Binary Log format

The binary store format is a row-oriented log file for consumption by the Data Import Server (DIS). All multi-byte numerical values are stored in network byte order (Big-Endian), which is also the default for Java ByteBuffers. Unless another size is specified, integers are 4 bytes.

  • The file begins with the 4-byte magic number 0xDB1AA1DB. This specific value is used to distinguish this format from an older version where the first 4 bytes directly represented the number of columns as a big-endian integer.
  • The next four bytes are version number. The current version of the file is 3.
  • The next four bytes are the remaining header size (excluding the magic number, version, and this size field itself, but including the header data and digest).
  • The header contains records. Each type of record may occur only once. Some types of records are unconditionally required, including the Column Definition Record. A record begins with an integer containing the size of the record (excluding the size and type) and another integer containing the type of record.
  • The header ends with an Adler-32-bit digest of the header (less the digest itself).

Column Definition Record

A column definition record contains the columns stored in this file. The type of a column definition record is 1 (an integer). The column definition record begins with an integer value containing the number of columns in this record (also the file).

For each column, we write the following:

  • Name of the column as a UTF-8 string (encoded as described, in data encoding below: length of the name as an integer, followed by the bytes of the name)
  • The type of the column as an integer.
    • Boolean - 1
    • Byte - 2
    • Char - 3
    • Short - 4
    • Int - 5
    • Long - 6
    • Float - 7
    • Double - 8
    • Blob - 9
    • EnhancedString - 10
    • Enum - 11
  • An integer containing the size of this type's metadata.
  • The type-specific metadata. No metadata is required for primitive types (char, double, float, int, long, short) or blobs. EnhancedString and Enum require the following meta-data:
    • EnhancedString: The EncodingInfo used for this column is represented as a string (e.g., "UTF-8"), typically the name of a character encoding. Currently supported values are "ISO-8859-1" or "UTF-8". Note that this string is also prefixed by an int size, and the four bytes of this size are included in the size int from the previous bullet. E.g. if UTF-8 is used, there will be a size of 9 for the metadata, and the metadata will be comprised of a four-byte int value of 5, which is the size of the string containing "UTF-8", followed by the string itself.
    • Enum: An integer count of string values. Followed by a string specifying the character encoding for the enum string values (e.g., "UTF-8"), itself encoded as an integer length followed by the UTF-8 bytes of the encoding name. Followed by count Strings represented as described in the Data Encoding section on EnhancedStrings.

Constant Columns Record

A constant column record contains values for columns that are not stored later in row records; but take on the same value for each row in the file. The type of a constant columns record is 2 (an integer). The constant column record begins with the number of constant columns. For each constant column, we write the following:

  • String name (integer length followed by UTF-8 bytes)
  • The constant value, stored according to the data encoding rules in the Log Data section.

Application Version Number

An application version number record indicates which application-level protocol is used for this log file. The type of an application version number record is 3 (an integer). The application version number record contains a single integer. When not present, the application version number is treated as if it were "0" and the most recent application listener will be used to read the file. For any other application version number, the listener must match the stored version.

Log Data

The bulk of the binary store file will be row-oriented log data. The log data consists of records, which begin with an integer size (which excludes the size itself but includes the flags and all following data). The next byte is a flags byte. The low-order bits of the flag byte are used to mark sets of records that should be applied as a unit.The lowest order bit (1) indicates the start of a set of record.The second-lowest order bit (2) indicates the end of a set of records.A standalone record has both flags set (3).Non-row records have the third bit set (4). Each non-row record's payload (following the flags byte) begins with an integer type and an integer version, as detailed in the 'Non-Row Records' section below.

Row records

The row record begins with a presence map. The presence map is a bitmap, represented as a series of bytes. The number of bits in the presence map is equal to the total number of column definitions in the header less the number of constant columns. For example, if a table has 10 columns, including 1 partitioning column (meaning 9 columns are defined in the header's Column Definition Record), and if the header also contained 2 constant value columns, then the number of bits in the presence map would be calculated as: (Total columns in Column Definition Record) - (Number of constant columns) = 9 - 2 = 7 bits. The presence map's size in bytes is equal to the number of bits divided by 8; rounded up to the nearest byte. Each of the included columns is assigned a sequential index starting at zero, according to relative order in the Column Definition Record. The bit in the presence map for column i is contained within byte i/8; and the corresponding bit is i % 8 (where bit 0 is the least significant bit of the byte). If a field is not null, then the corresponding bit in the presence map is set. If a field is null, then the corresponding bit in the presence map is not set.

For each of the columns that are not null in the presence map; the data will be encoded in the same relative order as the Column Definition Record order. The encoding mechanism is described in the Data Encoding section.

Each row record ends with a Adler-32 digest of the row (including the size, flags, and data, but excluding the digest itself).

Data Encoding

The byte, char, double, float, short, int, and long primitives will be written in network byte order (the default for java ByteBuffers). Other values are encoded as follows:

  • Boolean - 1 Byte: 0 for false, 1 for true
  • EnhancedString - Integer length; followed by the String's bytes.
  • Blob - Integer length; followed by a stream of bytes. Blobs (Binary Large OBjects) are interpreted by the consumer. Generally, when both the logger and the listener are Java processes, Java serialization is used. The listener must be able to produce a Java object from the representation (at a minimum a byte array can be produced).
  • Enum - An integer value, referencing the Enum mapping table from the Column Definition header.

Note

Note that null values may not be encoded in row data, their presence map entry must be unset.

For constant columns, nulls may be required. In which case, the byte, char, double, float, short, int, and long QueryConstants.NULL_BYTE, QueryConstants.NULL_CHAR, QueryConstants.NULL_DOUBLE, QueryConstants.NULL_FLOAT, QueryConstants.NULL_SHORT, QueryConstants.NULL_INT, and QueryConstants.NULL_LONG values are used, respectively. (Note: These QueryConstants are implementation-specific sentinel values, e.g., QueryConstants.NULL_INT might be Integer.MIN_VALUE or a specific value like -1, and should be clearly defined or referenced by the implementation.)

For complex types NULL is encoded as:

  • Boolean - QueryConstants.NULL_BYTE
  • EnhancedString - The integer length field is set to QueryConstants.NULL_INT. No subsequent bytes for the string content are written.
  • Blob - The integer length field is set to QueryConstants.NULL_INT. No subsequent bytes for the blob content are written.
  • Enum - QueryConstants.NULL_INT

Non-Row Records

The format of all non-row records begins with a type and version. Additional data, if any, depends on the values of these fields.

  • int - Record Type
  • int - Version of the record type
Known Record Types
  • 20000 - Command Record
Command Record format
  • int - Command Id
  • buffer - Additional data for the Command Id
Known Command Ids
  • 10000 - DELETE_PARTITION
Additional Data by Command Id
  • DELETE_PARTITION
  • <none>