Skip to main content

How to perform dedicated aggregations for groups

This guide will show you how to compute summary information on groups of data using dedicated data aggregations.

Often when working with data, you will want to break the data into subgroups and then perform calculations on the grouped data. For example, a large multi-national corporation may want to know their average employee salary by country, or a teacher might want to calculate grade information for groups of students or in certain subject areas.

The process of breaking a table into subgroups and then performing a single type of calculation on the subgroups is known as "dedicated aggregation." The term comes from most operations creating a summary of data within a group (aggregation) and from a single type of operation being computed at once (dedicated).

Why use dedicated aggregations?#

Deephaven provides many dedicated aggregations, such as maxBy and minBy. These aggregations are good options if only one type of aggregation is needed. If more than one type of aggregation is needed or if you have a custom aggregation, combined aggregations are a more efficient and more flexible solution.

Syntax#

The general syntax follows:

result = source.firstBy(columnNames...)

The columnNames parameter determines the column(s) by which to group data.

  • NULL uses the whole table as a single group
  • "X" will output the desired value for each group in column X.
  • "X", "Y" will output the desired value for each group designated from the X and Y columns.

What aggregations are available?#

Each dedicated aggregator performs one calculation at a time:

  • firstBy - First row of each group.
  • lastBy - Last row of each group.
  • headBy - First n rows of each group.
  • tailBy - Last n rows of each group.
  • countBy - Number of rows in each group.
  • sumBy - Sum of each group.
  • avgBy - Average (mean) of each group.
  • stdBy - Standard deviation of each group.
  • varBy - Variance of each group.
  • medianBy - Median of each group.
  • minBy - Minimum value of each group.
  • maxBy - Maximum value of each group.

Dedicated aggregators#

In the following examples, we have test results in various subjects for some students. We want to summarize this information to see if students perform better in one class or another.

from deephaven.TableTools import newTable, stringCol, intCol, doubleColfrom deephaven import ComboAggregateFactory as caf
source = newTable(    stringCol("Name", "James", "James", "James", "Lauren", "Lauren", "Lauren", "Zoey", "Zoey", "Zoey"),    stringCol("Subject", "Math", "Science", "Art", "Math", "Science", "Art", "Math", "Science", "Art"),    intCol("Number", 95, 100, 90, 72, 78, 92, 100, 98, 96),)

firstBy and lastBy#

In this example, we want to know the first and the last test results for each student. To achieve this, we can use firstBy to return the first test value and lastBy to return the last test value. The results are grouped by Name.

first = source.firstBy("Name")last = source.lastBy("Name")

headBy and tailBy#

In this example, we want to know the first two and the last two test results for each student. To achieve this, we can use headBy to return the first n test values and tailBy to return the last n test value. The results are grouped by Name.

head = source.headBy(2, "Name")tail = source.tailBy(2, "Name")

countBy#

In this example, we want to know the number of tests each student completed. countBy returns the number of rows in the table as grouped by Name and stores that in a new column, NumTests.

count = source.countBy("NumTests", "Name")

Summary statistics aggregators#

In the following examples, we start with the same source table containing students' test results as used above.

caution

Applying these aggregations to a column where the average cannot be computed will result in an error. For example, the average is not defined for a column of string values. For more information on removing columns from a table, see dropColumns. The syntax for using dropColumns is:

result = source.dropColumns(droppedColumnNames...).sumBy(columnNames...)

sumBy#

In this example, sumBy calculates the total sum of test scores for each Name. Because a sum cannot be computed for the string column Subject, this column is dropped before applying sumBy.

sum = source.dropColumns("Subject").sumBy("Name")

avgBy#

In this example, avgBy calculates the average (mean) of test scores for each Name. Because an average cannot be computed for the string column Subject, this column is dropped before applying avgBy.

mean = source.dropColumns("Subject").avgBy("Name")

stdBy#

In this example, stdBy calculates the standard deviation of test scores for each Name. Because a standard deviation cannot be computed for the string column Subject, this column is dropped before applying stdBy.

stdDev = source.dropColumns("Subject").stdBy("Name")

varBy#

In this example, varBy calculates the variance of test scores for each Name. Because a variance cannot be computed for the string column Subject, this column is dropped before applying varBy.

var = source.dropColumns("Subject").varBy("Name")

medianBy#

In this example, medianBy calculates the median of test scores for each Name. Because a median cannot be computed for the string column Subject, this column is dropped before applying medianBy.

median = source.dropColumns("Subject").medianBy("Name")

minBy#

In this example, minBy calculates the minimum of test scores for each Name. Because a minimum cannot be computed for the string column Subject, this column is dropped before applying minBy.

minimum = source.dropColumns("Subject").minBy("Name")

maxBy#

In this example, maxBy calculates the maximum of test scores for each Name. Because a maximum cannot be computed for the string column Subject, this column is dropped before applying maxBy .

maximum = source.dropColumns("Subject").maxBy("Name")

Related documentation#