Table iterators
This guide will show you how to iterate over table data in Python queries. Deephaven offers several built-in methods on tables to efficiently iterate over table data via native Python objects. These methods return generators, which are efficient for iterating over large data sets, as they minimize copies of data and only load data into memory when needed. Additionally, these methods handle locking to ensure that all data from an iteration is from a consistent table snapshot.
Native methods
Deephaven offers the following table methods to iterate over table data:
These methods all return generators that yield Python data structures containing table data. Generators are efficient because they minimize copies of data and only load data into memory when needed. A generator can only iterate over a data structure once before it's exhausted.
One row at a time
iter_dict
and iter_tuple
return a generator that yields one row at a time. The former yields a dictionary with column names as keys, while the latter yields a named tuple with column names as attributes.
The following example iterates over a table one row at a time and prints its values:
from deephaven import empty_table
source = empty_table(10).update(["X = randomInt(0, 10)", "Y = randomBool()"])
for row in source.iter_dict():
x = row["X"]
y = row["Y"]
print(f"X: {x}\tY: {y}")
for row in source.iter_tuple():
x = row.X
y = row.Y
print(f"X: {x}\tY: {y}")
for x, y in source.iter_tuple():
print(f"X: {x}\tY: {y}")
- Log
- source
One chunk of rows at a time
iter_chunk_dict
and iter_chunk_tuple
return a generator that yields a chunk of rows at a time. Chunk size is defined in the function call. The former yields a dictionary with column names as keys, while the latter yields a named tuple with column names as attributes.
The following example iterates over a table one chunk of rows at a time and prints its values:
from deephaven import empty_table
source = empty_table(10).update(["X = randomInt(0, 10)", "Y = randomBool()"])
for chunk in source.iter_chunk_dict(chunk_size=5):
x = chunk["X"]
y = chunk["Y"]
print(f"X: {x}\tY: {y}")
for chunk in source.iter_chunk_tuple(chunk_size=5):
x = chunk.X
y = chunk.Y
print(f"X: {x}\tY: {y}")
for x, y in source.iter_chunk_tuple(chunk_size=5):
print(f"X: {x}\tY: {y}")
- Log
- source
If the chunk size is not specified, the default is 2048 rows. The following example does not specify a chunk size. The table it iterates over is only 6 rows long, so there is only one chunk.
from deephaven import empty_table
source = empty_table(6).update(["X = i", "Y = randomBool()", "Z = ii"])
for x, z in source.iter_chunk_tuple(["X", "Z"]):
print(f"X: {x}\tZ: {z}")
- Log
- source
Omit columns
All four available methods allow you to only iterate over certain columns in a table. The following example only iterates over the X
and Z
columns in the source
table:
from deephaven import empty_table
source = empty_table(10).update(
["X = randomInt(0, 10)", "Y = randomBool()", "Z = randomDouble(0, 100)"]
)
for row in source.iter_dict(["X", "Z"]):
x = row["X"]
z = row["Z"]
print(f"X: {x}\tZ: {z}")
for row in source.iter_tuple(["X", "Z"]):
x = row.X
z = row.Z
print(f"X: {x}\tZ: {z}")
for x, z in source.iter_tuple(["X", "Z"]):
print(f"X: {x}\tZ: {z}")
for chunk in source.iter_chunk_dict(["X", "Z"], chunk_size=5):
x = chunk["X"]
z = chunk["Z"]
print(f"X: {x}\tZ: {z}")
for chunk in source.iter_chunk_tuple(["X", "Z"], chunk_size=5):
x = chunk.X
z = chunk.Z
print(f"X: {x}\tZ: {z}")
for x, z in source.iter_chunk_tuple(["X", "Z"], chunk_size=5):
print(f"X: {x}\tZ: {z}")
- Log
- source
Schema ordering
Table iterators can be tolerant of schema ordering changes by unpacking values inside of the loop, as such:
from deephaven import empty_table
source = empty_table(5).update(["X = i", "Y = ii"])
for row in source.iter_tuple():
x = row.X
y = row.Y
print(x, y)
- Log
- source
However, unpacking the values in the for statement itself is not tolerant of schema ordering changes:
from deephaven import empty_table
source = empty_table(5).update(["X = i", "Y = ii"])
for x, y in source.iter_tuple():
print(x, y)
- Log
- source
There are two ways to ensure iteration is tolerant of schema ordering changes:
- Use
view
to limit the table to the desired columns before iteration. - Specify columns in the iteration call, as shown in the previous examples.
Performance considerations
Both the row-based and chunk-based methods are efficient when iterating over table data. Consider the following when choosing between available methods:
- Dicts are slightly slower than tuples, but they provide more flexibility.
- Both chunked and nonchunked methods copy data from a table into Python in a chunked way, so they are both efficient. The performance difference between the two is minimal.
- All of these methods automatically handle locking so that the iterations happen over a consistent view of the table.
- The row-based methods also allow you to choose a chunk size. This chunk size is the number of rows copied from the Deephaven table into Python at a time. The default chunk size is 2048 rows.
- The chunk-based iterators are slightly more performant than row-based, but require more complex code.