Intraday Binary Log format
The binary store format is a row-oriented log file for consumption by the data import server (DIS). All values are stored in network byte order (the default for Java byte buffers). Unless another size is specified, integers are 4 bytes.
Header
- The file will begin with a 4-byte magic number. The original format stored the number of columns in a big-endian number as the first value, therefore our magic number will have the high bit set, 0xDB1AA1DB.
- The next four bytes are version number. The current version of the file is 3.
- The next four bytes are the remaining header size (excluding the magic number, version, and size, including the header data and digest).
- The header contains records. Each type of record may occur only once. Some types of records are unconditionally required, including the Column Definition Record. A record begins with an integer containing the size of the record (excluding the size and type) and another integer containing the type of record.
- The header ends with an Adler-32-bit digest of the header (less the digest itself).
Column Definition Record
A column definition record contains the columns stored in this file. The type of a column definition record is "1". The column definition record begins with an integer value containing the number of columns in this record (also the file).
For each column, we write the following:
- Name of the column as a UTF-8 string (encoded as described, in data encoding below: length of the name as an integer, followed by the bytes of the name)
- The type of the column as an integer.
- Boolean - 1
- Byte - 2
- Char - 3
- Short - 4
- Int - 5
- Long - 6
- Float - 7
- Double - 8
- Blob - 9
- EnhancedString - 10
- Enum - 11
- An integer containing the size of this type's metadata.
- The type-specific metadata. No metadata is required for primitive types (char, double, float, int, long, short) or blobs. EnhancedString and Enum require the following meta-data:
- EnhancedString: The EncodingInfo used for this column represented as the string name from the enum. Currently supported values are "ISO-8859-1" or "UTF-8". Note that this string is also prefixed by an int size, and the four bytes of this size are included in the size int from the previous bullet. E.g. if UTF-8 is used, there will be a size of 9 for the metadata, and the metadata will be comprised of a four-byte int value of 5, which is the size of the string containing "UTF-8", followed by the string itself.
- Enum: An integer count of string values. Followed by a string containing the encoding (integer length, followed by UTF-8 bytes). Followed by count Strings represented as described in the Data Encoding section on EnhancedStrings.
Constant Columns Record
A constant column record contains values for columns that are not stored later in row records; but take on the same value for each row in the file. The type of a constant columns record is "2". The constant column record begins with the number of constant columns. For each constant column, we write the following:
- String name (integer length followed by UTF-8 bytes)
- The constant value, stored according to the data encoding rules in the Log Data section.
Application Version Number
An application version number record indicates which application-level protocol is used for this log file. The type of an application version number record is "3". The application version number record contains a single integer. When not present, the application version number is treated as if it were "0" and the most recent application listener will be used to read the file. For any other application version number, the listener must match the stored version.
Log Data
The bulk of the binary store file will be row-oriented log data. The log data consists of records, which begin with an integer size (which excludes the size itself but includes the flags and all following data). The next byte is a flags byte. The low-order bits of the flag byte are used to mark sets of records that should be applied as a unit.The lowest order bit (1) indicates the start of a set of record.The second-lowest order bit (2) indicates the end of a set of records.A standalone record has both flags set (3).Non-row records have the third bit set (4). Each non-row record will contain an integer type followed by data for that type of record.
Row records
The row record begins with a presence map. The presence map is a bitmap, represented as a series of bytes. The number of bits in the presence map is equal to the total number of column definitions in the header less the number of constant columns. For example, if a table has 10 columns, including 1 partitioning column; then the header will contain 9 column definitions. If the header also contained 2 constant value columns, then there will be 7 bits in the presence map. The presence map's size in bytes is equal to the number of bits divided by 8; rounded up to the nearest byte. Each of the included columns is assigned a sequential index starting at zero, according to relative order in the Column Definition Record. The bit in the presence map for column i is contained within byte i/8; and the corresponding bit is i % 8. If a field is not null, then the corresponding bit in the presence map is set. If a field is null, then the corresponding bit in the presence map is not set.
For each of the columns that are not null in the presence map; the data will be encoded in the same relative order as the Column Definition Record order. The encoding mechanism is described in the Data Encoding section.
Each row record ends with a Adler-32 digest of the row (including the size, flags, and data, but excluding the digest itself).
Data Encoding
The byte, char, double, float, short, int, and long primitives will be written in network byte order (the default for java ByteBuffers). Other values are encoded as follows:
- Boolean - 1 Byte: 0 for false, 1 for true
- EnhancedString - Integer length; followed by the String's bytes.
- Blob - Integer length; followed by a stream of bytes. Blobs (Binary Large OBjects) are interpreted by the consumer. Generally, when both the logger and the listener are Java processes, Java serialization is used. The listener must be able to produce a Java object from the representation (at a minimum a byte array can be produced).
- Enum - An integer value, referencing the Enum mapping table from the Column Definition header.
Note
Note that null values may not be encoded in row data, their presence map entry must be unset.
For constant columns, nulls may be required. In which case, the byte, char, double, float, short, int, and long QueryConstants.NULL_BYTE
, QueryConstants.NULL_CHAR
, QueryConstants.NULL_DOUBLE
, QueryConstants.NULL_FLOAT
, QueryConstants.NULL_SHORT
, QueryConstants.NULL_INT
, and QueryConstants.NULL_LONG
values are used, respectively.
For complex types NULL is encoded as:
- Boolean -
QueryConstants.NULL_BYTE
- EnhancedString - Length of
QueryConstants.NULL_INT
- Blob -
Length of QueryConstants.NULL_INT
- Enum -
QueryConstants.NULL_INT
Non-Row Records
The format of all non-row records begins with a type and version. Additional data, if any, depends on the values of these fields.
- int - Record Type
- int - Version of the record type
Known Record Types
- 20000 - Command Record
Command Record format
- int - Command Id
- buffer - Additional data for the Command Id
Known Command Ids
- 10000 -
DELETE_PARTITION
Additional Data by Command Id
DELETE_PARITION
<none>