Skip to main content
Version: Python

Kafka basic terminology

This guide defines the basic Kafka terms and concepts needed to understand how Kafka and Deephaven work together. If you're interested in higher-level discussions of the power of integrating Kafka with Deephaven, see our guide, Kafka introduction.

Kafka is a distributed event streaming platform that lets you read, write, store, and process events (also called records). In the sections that follow, we identify the Kafka fields you'll need to set as you write queries using the consume method.

Topics

Individual Kafka feeds are called "topics" and are identified by a topic name. Messages in a topic typically follow a uniform schema, but they don’t have to (more on this below).

  • Producers write data to topics.
  • Consumers read data from the topic.
  • Data from topics is always read in sequential order.
  • Every record received from Kafka contains exactly five fields:
    • partition
    • offset
    • timestamp
    • key
    • value

Partition

Partition is a positive integer value used as a way to "shard" (or partition) topics. Subscribers can opt to listen to only a subset of messages from a topic by selecting individual partitions.

A topic may have a single partition or many. The partition for a message is selected by the publisher when the data is written to the Kafka stream.

When choosing a partition, it is important to consider how to balance the load so that the producer and consumer can be scaled easily:

  1. Consistent hashing: the partition may be directly implied from a hash value of the key at publishing time. For example, if key represents a string security symbol (e.g., "MSFT"), the partition is calculated from the first letter of the security symbol "M". If each letter is a possible partition calculator, then M takes the value of 12 from possible values in [0, 25] from the alphabet.
  2. Randomly: the partition may be assigned at random, in order to balance the load most efficiently.

Kafka guarantees stable ordering of messages in the same partition. Stable ordering per partition means that the first messages written to a partition are the first messages read from that partition as well. Consider the following example:

  1. Message A is written to topic test partition 0.
  2. Message B is written to topic test partition 1.
  3. Message C is written to topic test partition 0.

When reading topic test partition 0, message A comes before C, and consumers can take advantage of that knowledge.

This is important when processing messages on consumers that require ordering preservation, such as stock market prices or order updates.

Offset

An offset is an integer number starting at zero that can be used to identify any message ever produced in a given partition.

When Kafka consumers subscribe to a topic, they specify the offset at which to start listening.

A consumer can start at:

  • a defined offset value.
  • the oldest available value.
  • the current offset, in which case the consumer will receive only newly-produced messages.

Timestamp

A timestamp value is inserted by the publisher at publication time (i.e., when the message is written to the topic).

Key

  • Keys are a sequence of bytes, without a predefined length, that are used as identifiers. Keys can be any type of byte payload. These can be, for example:

    • a string, such as "MSFT".
    • a complex JSON string with multiple fields.
    • a binary encoded double-precision floating-point number in IEEE 754 representation (8 bytes).
    • a binary encoded 32-bit integer (4 bytes).
  • Kafka may need to hash the key to produce a partition value. When computing the hash, the key is treated as an opaque sequence of bytes.

  • A key can be empty, in which case, partition assignments will be effectively random.

  • A key can comprise multiple separate parts (composite).

  • Keys that are identical are written to the same partition.

Value

Values are a sequence of bytes, without a predefined length. Keys map to values.

  • Values can be any type of byte payload. These can be, for example:

    • a simple string, such as "Deephaven".
    • a complex JSON string with multiple fields.
    • a binary encoded double-precision floating-point number in IEEE 754 representation (8 bytes).
    • a binary encoded 32-bit integer (4 bytes).
  • Values may be empty.

  • Most commonly, to represent multidimensional data associated with a key, the value will be a composite that represents multiple fields. For example, for a topic of weatherReports we might have:

    • Key set to the location, such as "New York, NY, USA".
    • Value set to contain the temperature, humidity, and cloud cover (such as 68º F, 80%, and Overcast).

Format

Given a particular topic, it is important for producers (data writers) and consumers (data readers) to agree on a format for what is being published (written) and subscribed (read). Commonly used formats are:

  • JSON - ASCII, human-readable and simple to handle.
  • Avro - Binary, more efficient, and more involved to handle.

JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format. JSON is based on as ASCII. It is both human- and machine-readable, and is widely supported across many environments.

The Apache Kafka consumer library provides ways to encode keys and values as basic types like string, short, integer, etc. with JSON as well as ways to encode more complex multifield or structured types.

Avro

Avro is in a compact binary form and has a very compact format useful for high-volume usage. Avro is a data serialization system that is not human-readable. For encoding Avro, separate libraries are required. The Apache Kafka consumer library provides ways to encode keys and values as basic types like int, double, string, etc. with Avro as well as ways to encode more complex multifield or structured types.