var_by
var_by
returns the variance for each group. Null values are ignored.
caution
Applying this aggregation to a column where the variance can not be computed will result in an error. For example, the variance is not defined for a column of string values.
Syntax
table.var_by(by: List[str]=[])
Parameters
Parameter | Type | Description |
---|---|---|
by optional | List[str] | The column(s) by which to group data.
|
Returns
A new table containing the variance for each group.
How to calculate variance
- Find the mean of the data set. Add all data values and divide by the sample size .
- Find the squared difference from the mean for each data value. Subtract the mean from each data value and square the result.
- Find the sum of all the squared differences. The sum of squares is all the squared differences added together.
- Calculate the variance. Variance is the sum of squares divided by the number of data points. The formula for variance for a sample set of data is:
Examples
In this example, var_by
returns the variance of the whole table. Because the variance can not be computed for the string columns X
and Y
, these columns are dropped before applying var_by
.
from deephaven import new_table
from deephaven.column import string_col, int_col
source = new_table([
string_col("X", ["A", "B", "A", "C", "B", "A", "B", "B", "C"]),
string_col("Y", ["M", "N", "O", "N", "P", "M", "O", "P", "M"]),
int_col("Number", [55, 76, 20, 130, 230, 50, 73, 137, 214]),
])
result = source.drop_columns(cols=["X", "Y"]).var_by()
- source
- result
In this example, var_by
returns the variance, as grouped by X
. Because the variance can not be computed for the string column Y
, this column is dropped before applying var_by
.
from deephaven import new_table
from deephaven.column import string_col, int_col
source = new_table([
string_col("X", ["A", "B", "A", "C", "B", "A", "B", "B", "C"]),
string_col("Y", ["M", "N", "O", "N", "P", "M", "O", "P", "M"]),
int_col("Number", [55, 76, 20, 130, 230, 50, 73, 137, 214]),
])
result = source.drop_columns(cols=["Y"]).var_by(by=["X"])
- source
- result
In this example, var_by
returns the variance, as grouped by X
and Y
.
from deephaven import new_table
from deephaven.column import string_col, int_col
source = new_table([
string_col("X", ["A", "B", "A", "C", "B", "A", "B", "B", "C"]),
string_col("Y", ["M", "N", "O", "N", "P", "M", "O", "P", "M"]),
int_col("Number", [55, 76, 20, 130, 230, 50, 73, 137, 214]),
])
result = source.var_by(by=["X", "Y"])
- source
- result