Data types in Deephaven and Python
This guide discusses data types in Deephaven and Python. Proper management of data types in queries leads to cleaner, faster, and more reusable code.
Python data types
Python has quite a few built-in data types. Let's explore some of the more common ones using Python's built-in type
function.
def print_type(obj):
print(f"{obj}: {type(obj)}.")
print_type(3) # Integer
print_type(3.14) # Float
print_type(3 + 3j) # Complex number
print_type(True) # Boolean
print_type("Hello world!") # String
print_type([1, 2, 3]) # List
print_type((1, 2, 3)) # Tuple
print_type({"a": 1, "b": 2, "c": 3}) # Dict
- Log
This only covers eight built-in data types. There are many more built-in types, and a ton more data types when modules are considered. An important detail in the code above is that, for each built-in type, each printout shows a type of class
. These built-in data types are all classes with their own properties and methods. If you've written Python code, there's a good chance you've created your own class. These classes, including the built-in types, are all objects.
If you want a complete set of information about any object in Python, the built-in help
function can tell you a whole lot more.
NumPy data types
NumPy is known mostly for its n-dimensional array data structure and library of processing routines for them. But did you know it also has built-in data types? These data types are very similar to that of the Java primitives, C++ data types, and the data types built into CPUs and GPUs.
Let's take a look at these NumPy data types.
import numpy as np
byte_arr = np.array([1], dtype=np.byte)
short_arr = np.array([1], dtype=np.short)
int_arr = np.array([1], dtype=np.intc)
long_arr = np.array([1], dtype=np.int_)
float_arr = np.array([1], dtype=np.single)
double_arr = np.array([1], dtype=np.double)
print_type(byte_arr[0]) # byte (1 byte)
print_type(short_arr[0]) # short (2 bytes)
print_type(int_arr[0]) # int (4 bytes)
print_type(long_arr[0]) # long (8 bytes)
print_type(float_arr[0]) # float (4 bytes)
print_type(double_arr[0]) # double (8 bytes)
- Log
NumPy uses these data types because they match with the CPU instruction set. Most programming languages use these data types for efficiency in calculations.
Deephaven data types
Deephaven tables, like Python and NumPy, has data types. These data types encompass columns (and by extension, cells) in tables. This can be illustrated by creating a new table with new_table
, and examining the column types with meta_table
.
from deephaven.column import int_col, double_col, string_col
from deephaven import new_table
my_table = new_table(
[
int_col("IntColumn", [1, 2, 3]),
double_col("DoubleColumn", [1.0, 2.0, 3.0]),
string_col("StringColumn", ["A", "B", "C"]),
]
)
my_table_metadata = my_table.meta_table
- my_table
- my_table_metadata
new_table
creates columns of specific types. In this case, my_table
has three columns of type int
, double
, and String
, respectively. An important detail of these columns is that these columns store the values as Java types. In the case of IntColumn
and DoubleColumn
, the types are the Java primitive int
and double
, while StringColumn
is of type java.lang.String
. This is because the Deephaven query engine is written largely in Java.
Memory footprint in Python
Deephaven tables can also hold arbitrary Python objects and Java objects. These objects (and columns containing them) are both slower and more memory intensive to use than Java primitives and strings. They are rarely used in high-performance real-time queries due to these drawbacks. For high-performance cases, use primitive columns.
As mentioned in the first section of the article, all types in Python are objects. Python's built-in numeric types like int
and float
don't line up with their Java primitive equivalents. For example, a Java primitive int
takes 4 bytes of memory. A Python int
takes much more:
import sys
def print_size_of(my_object):
print(f"The value {my_object} takes {sys.getsizeof(my_object)} bytes of memory.")
print_size_of(3)
- Log
An int
in Python takes 28 bytes by default (this number is hardware-dependent, but this is the norm). The number 3 is small enough to be stored in only 2 bits. Storing this number in a Java primitive int
would only take 4 bytes. Remember, a Python int
is an arbitrary precision integer, which leads to more memory overhead. This can make working with integers in Python a breeze, but they come at the cost of performance. Python will always use at least this amount of memory to store an integer. If the number is sufficiently large, Python will allocate more memory.
my_new_int = (1 << 30) - 1
my_new_bigger_int = 1 << 30
my_float = 1.2
my_bool = True
print_size_of(my_new_int)
print_size_of(my_new_bigger_int)
print_size_of(my_float)
print_size_of(my_bool)
- Log
As you can see, the memory footprint of my_new_int
(a huge number) is the same as my_int
(3). It's not until we reach my_bigger_int
, which is only 1 more than my_new_int
, that the required memory increases. This principle holds true for other scalar types like floats. Unfortunately, this Pythonic behavior doesn't translate well to Deephaven tables.
Memory footprint with NumPy
NumPy data types aren't just arbitrary objects like regular Python data types. This can be shown with the nbytes
attribute.
print(byte_arr.nbytes)
print(short_arr.nbytes)
print(int_arr.nbytes)
print(long_arr.nbytes)
print(float_arr.nbytes)
print(double_arr.nbytes)
- Log
We can see that the number of bytes required to store each of these single-element arrays is smaller than that of the equivalent Python object.
Memory footprint in Deephaven
Deephaven tables don't know how to infer Python object types unless they are explicitly told how. Let's look at an example.
from deephaven import empty_table
def multiply_and_subtract(x, y):
return x * y - (x + y)
my_empty_table = empty_table(5)
my_table = my_empty_table.update(
["X = i", "Y = 2 * i", "Z = multiply_and_subtract(X, Y)"]
)
my_column_types = my_table.meta_table
- my_table
- my_column_types
Looking at my_column_types
, we can see that X
and Y
are int
columns, but Z
is an org.jpy.PyObject
column. What is an org.jpy.PyObject
, and why is this the case?
jpy is the Python-Java bridge used by the query engine that does the necessary bi-directional Python-Java translations. An org.jpy.PyObject
is its generic object that holds a Python value it knows little about. This object, like other objects, is safe. It's also slow and has large memory overhead. That's not good for high-performance real-time queries and queries on big data, but it is very flexible.
When Deephaven's query engine sees the query string Z = multiply_and_subtract(X, Y)
, it knows that X
and Y
are int
columns. However, it knows little to nothing about what multiply_and_subtract
actually does to them. Thus, in order to be safe and not raise an error, it returns a column of type org.jpy.PyObject
, since that's the safe thing to do. Now, if you explicitly typecast the output of the function in the query string, the column Z
will be the type you want.
my_table = empty_table(5).update(
["X = i", "Y = 2 * i", "Z = (int)multiply_and_subtract(X, Y)"]
)
my_column_types = my_table.meta_table
- my_table
- my_column_types
Since we told the query engine what type to return, we get our expected result.
This pattern of explicitly casting the output of Python functions in query strings is common. It ensures that your data will be of the type you want. This has some nice benefits:
- Explicit typecasts in query strings can make queries easier to understand.
- You have a high level of control over data in queries, and thus, the amount of memory required.
However, it has a minor drawback. Queries are rarely as simple as the example above. In real applications of Deephaven, a single table operation can contain dozens of query strings. When a large number of adjacent query strings have explicit typecasts, the table operation can look unsightly. So, is there an alternative?
Python type hints
Python type hints allow you to tell the interpreter what data types to expect in both the input and output of functions. Deephaven Python queries can take full advantage of these type hints for table operations. Let's revisit the previous example, but add a type hint to the output of the function multiply_and_subtract
.
def multiply_and_subtract(x, y) -> int:
return x * y - (x + y)
my_table = my_empty_table.update(
["X = i", "Y = 2 * i", "Z = multiply_and_subtract(X, Y)"]
)
my_column_types = my_table.meta_table
- my_table
- my_column_types
Now the Z
column isn't just some org.jpy.PyObject
column, but a column of Java primitives. Except, it's not an int
column. It's a long
column! Why is that?
We have to circle back to the earlier section on memory in Python. A Python int
is actually an arbitrary precision integer that can hold very large numbers. A Java primitive int
is a 32 bit integer that can only store values up to a magnitude of ~2 billion. A Java long
, on the other hand, can store absolutely enormous numbers, which is on par with a Python int
. Thus, Java long
is the closest type match for a Python int
.
Use NumPy
We saw earlier that NumPy's data types take up much less memory than Python objects. The data types line up nicely with Deephaven primitives. So, let's use a NumPy data type in our type hint instead.
import numpy as np
def multiply_and_subtract(x, y) -> np.intc:
return x * y - (x + y)
my_table = my_empty_table.update(
["X = i", "Y = 2 * i", "Z = multiply_and_subtract(X, Y)"]
)
my_column_types = my_table.meta_table
- my_table
- my_column_types
Optional return values
Python's typing module not only enables type hints for a single return type, but also allows for functions to have optional return values. Deephaven's Python API supports this feature, enabling table operations to use Python functions that can create null values. The following example shows how to use Optional
in a type hint for a function that can return either a 64 bit integer, or None
.
from typing import Optional
import numpy as np
def myfunc(value) -> Optional[np.int64]:
return None if value % 2 == 1 else value * 2
my_table_with_nulls = my_empty_table.update(["X = i", "Y = myfunc(X)"])
- my_table_with_nulls
Array columns
It's common for queries to produce tables with columns that contain vector data. These columns typically contain arrays of data stored as Java primitive arrays. Using type hints from typing and numpy.typing, Python function results can be seamlessly transformed into Java primitive arrays.
from numpy import typing as npt
import numpy as np
import typing
def array_from_cols_typing(x, y) -> typing.List[np.int32]:
return [x, y]
def array_from_cols_np_typing(x, y) -> npt.NDArray[np.int32]:
return np.array([x, y], dtype=np.int32)
my_array_table = my_empty_table.update(
[
"X = i",
"Y = 2 * i",
"ArrFromList = array_from_cols_typing(X, Y)",
"ArrFromNumPy = array_from_cols_np_typing(X, Y)",
]
)
my_array_table_types = my_array_table.meta_table
- my_array_table
- my_array_table_types
Java arrays, like NumPy arrays, can only hold a single type of data. In the event you perform an operation that produces a list that contains more than one data type, your type hint should specify that it contains the largest data type. For instance, if you have an operation that produces a list that contains both integers and double precision values, the type hint should specify that the list contains double precision values.
from typing import List
import numpy as np
def arr_multiple_types(x) -> List[np.double]:
return [x, x + 0.1]
array_table_two = my_empty_table.update(["X = i", "Array = arr_multiple_types(X)"])
array_table_two_types = array_table_two.meta_table
- array_table_two
- array_table_two_types
Array columns also support the Optional
annotation:
from deephaven import empty_table
from numpy import typing as npt
from typing import Optional
import numpy as np
def array_func(col) -> Optional[npt.NDArray[np.double]]:
if col % 2 == 0:
return None
else:
return np.array([col, col * 1.1], dtype=np.double)
array_table_with_nulls = my_empty_table.update(["X = i", "Y = array_func(X)"])
- array_table_with_nulls
Key takeaways
- Queries run faster and use less memory with known data types instead of generic objects.
- Type hints allow Deephaven to automatically determine the return type of Python functions.
- NumPy offers a much larger variety of data types than Python.
- Python methods that produce arrays should use typing and numpy.typing type hints.