Skip to main content

Take advantange of Pandas 2.0 support

· 2 min read
AI prompt: two pandas standing on colorful plastic cubes, isolated on a dark blue background
Jianfeng Mao
Built-in, highly efficient PyArrow

Pandas 2.0 was released in April 2023, marking three years of development. It offers a range of new features, such as enhanced extension array support, DataFrames support for PyArrow, and non-nanosecond datetime resolution.

The defining feature of Pandas 2.0 is its PyArrow support, in addition to its long-running support of NumPy. PyArrow is a Python library (built on top of Arrow) that provides an interface for handling large datasets using Arrow memory structures, as well as tools for serialization, compression, and integration with other data processing systems such as Apache Spark, Apache Parquet, and Deephaven.

The new Pandas 2.0 DataFrame PyArrow backend offers greater flexibility, reduces memory consumption, and increases interoperability with technologies like Deephaven.

Benchmarks comparing mean and replace operations using PyArrow and Numpy back-end showed many folds of performance improvement.

Deephaven has committed to the support of Arrow very early - the engine uses Arrow extensively, and Deephaven's APIs provide extensions and efficient conversions between Arrow data and Deephaven data. A user can export a Deephaven table to or import one from a PyArrow table in both Deephaven's Python server and client APIs. Deephaven also has long supported converting between Pandas Dataframes and Deephaven tables. Such conversion relies on the use of Numpy arrays. With the release of Pandas 2.0, Deephaven now offers an option to use the PyArrow backend when carrying out the conversion between Deephaven and Pandas.

Benefits of using the PyArrow backend option:

  1. The conversion takes advantage of the built-in, highly efficient support of Arrow in the engine, with fewer boundary crossings of JVM and Python and less memory consumption.
  2. Null value mapping between Arrow and Deephaven requires no special handling, unlike with the Numpy option.
  3. Future enhancement of direct buffer sharing between Python and JVM will make the conversion even more efficient.

Deephaven still supports Pandas 1.x. We encourage upgrading to Pandas 2.0, but users have plenty of time to plan for such upgrades based on their specific situation.

To learn more about how this upgrade may affect your queries, always feel free to reach out on Slack.