TailInitializationFilter
TailInitializationFilter reduces the input size for downstream operations by limiting initialization to only the most recent rows from each partition. This is particularly useful when working with large historical datasets where you're primarily interested in the tail of the data.
The filter is designed to work with add-only source tables where each contiguous range of row keys represents a partition. Each partition must be sorted by timestamp, with the most recent timestamp at the end.
Once initialized, the filter passes through all new rows. Rows that have already been filtered are not removed or modified.
Syntax
Parameters
mostRecent (period)
| Parameter | Type | Description |
|---|---|---|
| table | Table | The source table to filter. Must be add-only with partitions sorted by timestamp. |
| timestampName | String | The name of the timestamp column used to determine recency. |
| period | String | The time period string specifying how far back from the last row to include rows. The period is parsed using Examples: |
mostRecent (nanos)
| Parameter | Type | Description |
|---|---|---|
| table | Table | The source table to filter. Must be add-only with partitions sorted by timestamp. |
| timestampName | String | The name of the timestamp column used to determine recency. |
| nanos | long | The interval in nanoseconds between the last row in a partition and rows that match the filter. |
mostRecentRows
| Parameter | Type | Description |
|---|---|---|
| table | Table | The source table to filter. Must be add-only. |
| rowCount | long | The number of rows to include per partition. |
Returns
A table containing only the most recent values from each partition in the source table.
How it works
For each partition, the filter uses the last row's timestamp as the reference point. It subtracts the specified period from this timestamp and performs a binary search to identify rows within that time window.
The filter makes these assumptions:
- The source table is add-only (no modifications, shifts, or removals).
- Each contiguous range of row keys is a partition. If this assumption is violated, the
TailInitializationFilteritself is inefficient as it must independently examine each range of row keys. Furthermore, because the end of each range is used as the timestamp, fewer rows are likely to be filtered before being passed to downstream operations. One common way to violate this assumption is to filter the table before theTailInitializationFilteris applied. - Each partition is sorted by timestamp.
- Null timestamps are not permitted.
Warning
If you suspect you might not have full access to a given table, consider comparing initialization time with and without the TailInitializationFilter. Implicit ACL application may reduce the operation's effectiveness.
Examples
Filter by time period
This example uses a time table and filters to show only rows from the last 10 seconds:
This filters to show only rows where the timestamp is within 10 seconds of the most recent row in the table.
Filter by time in nanoseconds
This example filters to show rows from the last 5 seconds (5 billion nanoseconds):
Filter by row count
The mostRecentRows() method filters to show a specified number of rows from the end of each partition: