Use SciPy
This guide will show you how to use SciPy in your Python queries in Deephaven.
Refer to our guide How to install Python packages for instructions on installing this package in Deephaven.
SciPy is an open-source scientific computing library for Python. It contains a large assortment of sub-packages with methods that can perform tasks in the following areas:
- Optimization
- Linear algebra
- Calculus
- Interpolation
- Transforms
- Signal processing
- Differential equations
- Multi-dimensional array processing
The above list is not comprehensive; SciPy's many sub-packages cover an even broader range of topics (see the Appendix at the bottom of this guide for a brief description of each). SciPy was built to operate on NumPy arrays, which we'll be using extensively in this guide. Check out How to use NumPy to learn more about how they work.
Examples
The examples below demonstrate using SciPy in a traditional Python setting, then extend them to use Deephaven. SciPy is a natural pair for deephaven.learn
, so some of the corresponding Python queries with Deephaven use it. For more information on the learn
submodule, check out How to use deephaven.learn.
Describe data
In this example, we'll use scipy.stats.describe
to give basic insight into some data. The data we'll be describing is an ndarray
with integer values from 0 to 99.
from scipy.stats import describe
import numpy as np
source = np.arange(100)
result = describe(source)
print(result)
To get these insights on data in a Deephaven table, we'll convert it to a NumPy array by using pandas.to_pandas
and values
. We can then use the method just as before.
from deephaven import empty_table, pandas
from scipy.stats import describe
import pandas as pd
import numpy as np
source = empty_table(100).update(formulas=["X = i"])
result = describe(np.squeeze(pandas.to_pandas(source).values))
print(result)
Compute the Airy function
In this example, we'll use scipy.special.airy
to compute the Airy function of some values. In this case, we'll calculate the Airy function on values ranging from -20 to 0 in increments of 0.2.
from scipy.special import airy
import numpy as np
x = np.linspace(-20, 0, 101)
ai, aip, bi, bip = airy(x)
Let's extend this code to use Deephaven tables!
from deephaven import empty_table
from scipy.special import airy
import numpy as np
source = empty_table(101).update(formulas=["x = -20 + 20 * i / 100"])
def compute_airy(x):
ai, aip, bi, bip = airy(x)
return ai, aip, bi, bip
result = source.update(formulas=["y = compute_airy(x)"])
Filter a noisy signal
In this example, we'll generate a Sine wave and add noise. Then, we'll apply filters to reduce the noise.
from scipy.signal import medfilt, savgol_filter
import numpy as np
x = np.linspace(0, 2 * np.pi, 101)
noisy_sine_wave = np.sin(x) + np.random.normal(0, 0.25, 101)
median_filtered = medfilt(noisy_sine_wave)
savgol_filtered = savgol_filter(noisy_sine_wave, 27, 3)
The above example can be extended to use deephaven.learn
in a couple of ways.
from deephaven.plot.figure import Figure
from deephaven.learn import gather
from deephaven import empty_table
from deephaven import learn
from scipy.signal import medfilt, savgol_filter
import numpy as np
def create_noisy_sine_wave(x) -> np.double:
return np.sin(x) + np.random.normal(0, 0.25)
source = empty_table(101).update(
formulas=[
"X = i * (2 * Math.PI) / 100",
"Noisy_Sine_Wave = create_noisy_sine_wave(X)",
]
)
def apply_median_filter(signal):
return medfilt(signal)
def apply_savgol_filter(signal):
return savgol_filter(signal, 27, 3)
def table_to_numpy(rows, cols):
return np.squeeze(gather.table_to_numpy_2d(rows, cols, np_type=np.double))
def numpy_to_table(data, idx):
return data[idx]
median_filtered = learn.learn(
table=source,
model_func=apply_median_filter,
inputs=[learn.Input("Noisy_Sine_Wave", table_to_numpy)],
outputs=[learn.Output("Median_Filtered", numpy_to_table, "double")],
batch_size=source.size,
)
savitzky_golay_filtered = learn.learn(
table=source,
model_func=apply_savgol_filter,
inputs=[learn.Input("Noisy_Sine_Wave", table_to_numpy)],
outputs=[learn.Output("Savitzky_Golay_Filtered", numpy_to_table, "double")],
batch_size=source.size,
)
plt = (
Figure()
.plot_xy(series_name="Noisy Signal", t=source, x="X", y="Noisy_Sine_Wave")
.plot_xy(
series_name="Median Filtered", t=median_filtered, x="X", y="Median_Filtered"
)
.plot_xy(
series_name="Savitzky-Golay Filtered",
t=savitzky_golay_filtered,
x="X",
y="Savitzky_Golay_Filtered",
)
.show()
)
Compute the nearest neighbor of every point in a set
In this example, we'll use a K-d tree to compute the nearest neighbor of every point in a set. SciPy's implementation is called scipy.spatial.KDTree
. We construct a two-dimensional data set with a few points, and we'll find the index of each point's nearest neighbor in the set.
from scipy.spatial import KDTree as kdtree
import numpy as np
x = np.random.randint(-10, 10, 10)
y = np.random.randint(-10, 10, 10)
data = np.vstack((x, y)).T
print(data)
tree = kdtree(data)
distances, indices = tree.query(data, k=2)
nearest_neighbor_indices = indices[:, 1]
print(nearest_neighbor_indices)
Computing the nearest-neighbor distances and indices can also be done using deephaven.learn
.
from deephaven.pandas import to_table
from deephaven.learn import gather
from deephaven import learn
from scipy.spatial import KDTree as kdtree
import pandas as pd
import numpy as np
indices = np.arange(10)
x = np.random.randint(-10, 10, 10)
y = np.random.randint(-10, 10, 10)
source = to_table(pd.DataFrame({"Index": indices, "X": x, "Y": y}))
def calculate_nearest_neighbor_indices(data):
tree = kdtree(data)
distances, indices = tree.query(data, k=2)
return indices[:, 1]
def table_to_numpy(rows, cols):
return gather.table_to_numpy_2d(rows, cols, np_type=np.int_)
def numpy_to_table(data, idx):
return data[idx]
result = learn.learn(
table=source,
model_func=calculate_nearest_neighbor_indices,
inputs=[learn.Input(["X", "Y"], table_to_numpy)],
outputs=[learn.Output("Nearest_Neighbor_Index", numpy_to_table, "int")],
batch_size=source.size(),
)
Compute distances of various metrics between two vectors
In this example, we create two three-dimensional vectors. We then compute the distance between the two vectors using metrics supported by scipy.spatial.distance
.
from scipy.spatial import distance
import numpy as np
x = np.array([1, 3, -1])
y = np.array([-2, 0, 4])
def print_distances(x, y):
print("x: " + str(x))
print("y: " + str(y))
print("Bray-Curtis distance between x and y: " + str(distance.braycurtis(x, y)))
print("Chebyshev distance between x and y: " + str(distance.chebyshev(x, y)))
print("Cosine distance between x and y: " + str(distance.cosine(x, y)))
print("Euclidean distance between x and y: " + str(distance.euclidean(x, y)))
print("City block distance between x and y: " + str(distance.cityblock(x, y)))
print_distances(x, y)
We can once again use deephaven.learn
to calculate these distances.
from deephaven import new_table, learn
from deephaven.column import int_col
from deephaven.learn import gather
from scipy.spatial import distance
import numpy as np
source = new_table([int_col("X", [1, 3, -1]), int_col("Y", [-2, 0, 4])])
def table_to_numpy(rows, cols):
return np.squeeze(gather.table_to_numpy_2d(rows, cols, np_type=np.intc))
def numpy_to_table(data, idx):
return data[idx]
def print_distances(data):
x = data[:, 0]
y = data[:, 1]
print("x: " + str(x))
print("y: " + str(y))
print("Bray-Curtis distance between x and y: " + str(distance.braycurtis(x, y)))
print("Chebyshev distance between x and y: " + str(distance.chebyshev(x, y)))
print("Cosine distance between x and y: " + str(distance.cosine(x, y)))
print("Euclidean distance between x and y: " + str(distance.euclidean(x, y)))
print("City block distance between x and y: " + str(distance.cityblock(x, y)))
learn.learn(
table=source,
model_func=print_distances,
inputs=[learn.Input(["X", "Y"], table_to_numpy)],
outputs=None,
batch_size=source.size(),
)
Appendix: Sub-packages of SciPy
Here is a comprehensive list of the sub-packages available in SciPy, and a brief description of what each one does:
Sub-package name | Description |
---|---|
cluster | Clustering routines |
constants | A collection of constants and conversion utilities |
fft | Fourier transform methods |
fftpack | Deprecated legacy Fourier transform methods |
ingegrate | Numerical integration algorithms |
interpolate | Tools for data interpolation |
io | Data I/O |
linalg | Linear algebra routines |
misc | Extra miscellaneous utilities |
ndimage | Image processing library |
odr | Orthogonal distance regression library |
optimize | Algorithms for optimizing code |
signal | Signal processing routines |
sparse | Tools for sparse matrix creation, handling, and procesing |
spatial | Spatial structure processing algorithms |
special | Special functions |
stats | Statistical function library |
weave | Deprecated legacy C/C++ code writing tool |
Each is described below. We skip fftpack
and weave
because they are deprecated.## Appendix
scipy.cluster
scipy.cluster
is split into two further sub-modules:
scipy.cluster.vq
supports vector quantization and K-means algorithms.scipy.cluster.heirarchy
provides hierarchical and agglomerative clustering.
scipy.constants
scipy.constants
provides an extensive number of mathematical and physical constants including (but not limited to) pi, the golden ratio, the Planck constant, subatomic particle masses, SI prefixes, binary prefixes, and many others. The full list of constants is far, far too long to include here.
scipy.fft
scipy.fft
contains methods that apply the forward and reverse discrete Fourier, Sine, Cosine, and Hankel transformations to input data in N dimensions.
scipy.integrate
scipy.integrate
contains a slew of methods to calculate integrals and solutions for differential equations.
scipy.interpolate
scipy.interpolate
is a sub-package for objects used in data interpolation. It provides univariate and multivariate interpolation techniques, and spline interpolators.
scipy.io
scipy.io
provides utilities to read and write data to and from various file formats including MATLAB, Fortran, WAV, and others.
scipy.linalg
scipy.linalg
is a library of linear algebra functions. You can use this sub-module to solve eigenvalue problems, perform matrix decompositions, solve matrix equations, and construct special matrices among other things. Matrices constructed using this sub-package are of type numpy.ndarray
.
This sub-module contains two further sub-packages for low-level routines: scipy.linalg.blas
and scipy.linalg.lapack
. These two sub-packages provide low-level functions from the Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) libraries. Cython implementations of these low-level libraries exist as well.
Additionally, it provides routines for interpolative matrix decomposition via scipy.linalg.interpolative
.
scipy.misc
The miscellaneous routines sub-module, scipy.misc
, is the home of some example data sets and two special derivative calculators.
scipy.ndimage
scipy.ndimage
contains a library of methods to perform image processing. Methods include filters, interpolation routines, measurements, and morphology routines.
scipy.odr
scipy.odr
contains routines to perform orthogonal distance regression (ODR) on input data. ODR extends least squares fitting, which generally results in more accuracy in fitted curves.
scipy.optimize
The scipy.optimize
sub-module provides univariate and multivariate optimization routines. These optimization routines will minimize or maximize objective functions, and can handle constraints on said functions. Solvers for both linear and nonlinear problems is supported.
There are underlying C functions for four different root finders that can be accessed via Cython.
scipy.signal
scipy.signal
is SciPy's signal processing sub-module. It supports convolutions, B-splines, filters, filter designing routines, continuous time linear systems, discrete time linear systems, waveform generators, window functions, wavelet generators, maxima/minima finers, and spectral analyzers.
scipy.sparse
scipy.sparse
contains functions for building and identifying sparse matrices in various formats. Additionally, it contains sparse matrix classes for each individual sparse matrix format.
This sub-module contains two sub-packages: scipy.sparse.csgraph
and scipy.sparse.linalg
. The first contains a library of compressed sparse graph routines, and the second contains a library of sparse linear algebra routines.
scipy.spatial
scipy.spatial
contains spatial algorithms and data structures. Features include nearest-neighbor query routines, triangulation, convex hulls, Voronoi diagrams, and plotting helpers.
One sub-package of scipy.spatial
is scipy.spatial.distance
, which provides utilities for computing distance matrices, cheing their validity, and distance functions in numeric and boolean vectors.
There is another sub-pacakage within scipy.spatial
called scipy.spatial.transform
, which provides functions for spatial transformations including rotations, spherical linear interpolations of rotations, and rotation splines.
scipy.special
SciPy special functions are defined in scipy.special
, which are extended to Cython via scipy.special.cython_special
. Special functions include (but are not limited to) Airy, Bessel, Struve, Legendre, and statistical distributions.
scipy.stats
The sub-module, scipy.stats
, contains probability distributions, summary statistics, frequency statistics, and functions for calculating correlations between distributions among other things. Methods exist for continuous, discrete, and multivariate distributions. There are are also some summary methods that can be used to describe data with little prior knowledge.
There are two sub-packages within scipy.stats
: scipy.stats.mstats
and scipy.stats.qmc
. The former is used for calculating statistics on masked arrays, and the latter is used for describing quasi-Monte Carlo capabilities.