TailInitializationFilter

TailInitializationFilter reduces the input size for downstream operations by limiting initialization to only the most recent rows from each partition. This is particularly useful when working with large historical datasets where you're primarily interested in the tail of the data.

The filter is designed to work with add-only source tables where each contiguous range of row keys represents a partition. Each partition must be sorted by timestamp, with the most recent timestamp at the end.

Once initialized, the filter passes through all new rows. Rows that have already been filtered are not removed or modified.

Syntax

Parameters

mostRecent (period)

ParameterTypeDescription
tableTable

The source table to filter. Must be add-only with partitions sorted by timestamp.

timestampNameString

The name of the timestamp column used to determine recency.

periodString

The time period string specifying how far back from the last row to include rows. The period is parsed using DateTimeUtils.parseDurationNanos().

Examples: "PT1H" (1 hour), "PT30M" (30 minutes), "PT10S" (10 seconds)

mostRecent (nanos)

ParameterTypeDescription
tableTable

The source table to filter. Must be add-only with partitions sorted by timestamp.

timestampNameString

The name of the timestamp column used to determine recency.

nanoslong

The interval in nanoseconds between the last row in a partition and rows that match the filter.

mostRecentRows

ParameterTypeDescription
tableTable

The source table to filter. Must be add-only.

rowCountlong

The number of rows to include per partition.

Returns

A table containing only the most recent values from each partition in the source table.

How it works

For each partition, the filter uses the last row's timestamp as the reference point. It subtracts the specified period from this timestamp and performs a binary search to identify rows within that time window.

The filter makes these assumptions:

  • The source table is add-only (no modifications, shifts, or removals).
  • Each contiguous range of row keys is a partition. If this assumption is violated, the TailInitializationFilter itself is inefficient as it must independently examine each range of row keys. Furthermore, because the end of each range is used as the timestamp, fewer rows are likely to be filtered before being passed to downstream operations. One common way to violate this assumption is to filter the table before the TailInitializationFilter is applied.
  • Each partition is sorted by timestamp.
  • Null timestamps are not permitted.

Warning

If you suspect you might not have full access to a given table, consider comparing initialization time with and without the TailInitializationFilter. Implicit ACL application may reduce the operation's effectiveness.

Examples

Filter by time period

This example uses a time table and filters to show only rows from the last 10 seconds:

This filters to show only rows where the timestamp is within 10 seconds of the most recent row in the table.

Filter by time in nanoseconds

This example filters to show rows from the last 5 seconds (5 billion nanoseconds):

Filter by row count

The mostRecentRows() method filters to show a specified number of rows from the end of each partition: