Histogram Plot

A histogram plot is a data visualization technique commonly used in statistics and data analysis to visualize the distribution of a single continuous variable. It consists of a series of contiguous, non-overlapping bars that provide a visual summary of the frequency or density of data points within predefined intervals or “bins.” The number of bins significantly impacts the visualization.

Histograms are appropriate when the data contain a continuous variable of interest. If there is an additional categorical variable that the variable of interest depends on, layered histograms may be appropriate using the by argument.

What are histograms useful for?

  • Data distribution analysis: Histograms are a valuable tool to gain insights into the distribution of a dataset, making it easier to understand the central tendencies, spread, and skewness of the data.
  • Identifying outliers: Histograms help in detecting outliers or anomalies in a dataset by highlighting data points that fall outside the typical distribution.
  • Density estimation: Histograms can serve as the basis for density estimation methods, helping to model and understand underlying data distributions, which is crucial in statistical analysis and machine learning.

Examples

A basic histogram

Visualize the distribution of a single variable by passing the column name to the x or y arguments.

import deephaven.plot.express as dx
iris = dx.data.iris()

# subset to get specific species
setosa = iris.where("Species == `setosa`")

# control the plot orientation using `x` or `y`
hist_plot_x = dx.histogram(setosa, x="SepalLength")
hist_plot_y = dx.histogram(setosa, y="SepalLength")

Modify the bin size by setting nbins equal to the number of desired bins.

import deephaven.plot.express as dx
iris = dx.data.iris()

# subset to get specific species
virginica = iris.where("Species == `virginica`")

# too many bins will produce jagged, disconnected histograms
hist_20_bins = dx.histogram(setosa, x="SepalLength", nbins=20)

# too few bins will mask distributional information
hist_3_bins = dx.histogram(setosa, x="SepalLength", nbins=3)

# play with the `nbins` parameter to get a good visualization
hist_8_bins = dx.histogram(setosa, x="SepalLength", nbins=8)

Bin and aggregate on different columns

If the plot orientation is vertical ("v"), the x column is binned and the y column is aggregated. The operations are flipped if the plot orientation is horizontal.

import deephaven.plot.express as dx
iris = dx.data.iris()

# subset to get specific species
setosa = iris.where("Species == `setosa`")

# The default orientation is "v" (vertical) and the default aggregation function is "sum"
hist_v = dx.histogram(setosa, x="SepalLength", y="SepalWidth")

# Control the plot orientation using orientation
hist_h = dx.histogram(setosa, x="SepalLength", y="SepalWidth", orientation="h")

# Control the aggregation function using histfunc
hist_avg = dx.histogram(setosa, x="SepalLength", y="SepalWidth", histfunc="avg")

Distributions of several groups

Histograms can also be used to compare the distributional properties of different groups of data, though they may be a little harder to read than box plots or violin plots. Pass the name of the grouping column(s) to the by argument.

import deephaven.plot.express as dx
iris = dx.data.iris()

# each bin may be stacked side-by-side for each group
stacked_hist = dx.histogram(iris, x="SepalLength", by="Species")

# or, each bin may be overlaid with the others
overlay_hist = dx.histogram(iris, x="SepalLength", by="Species", barmode="overlay")

API Reference

Returns a histogram

Returns: DeephavenFigure A DeephavenFigure that contains the histogram

ParametersTypeDefaultDescription
tablePartitionedTable |
Table |
DataFrame
A table to pull data from.
xstr |
list[str] |
None
NoneA column name or list of columns that contain x-axis values. Column values must be numeric. If x is specified, the bars are drawn vertically by default.
ystr |
list[str] |
None
NoneA column name or list of columns that contain y-axis values. Column values must be numeric. If only y is specified, the bars are drawn horizontally by default.
bystr |
list[str] |
None
NoneA column or list of columns that contain values to plot the figure traces by. All values or combination of values map to a unique design. The variable by_vars specifies which design elements are used. This is overriden if any specialized design variables such as color are specified
by_varsstr |
list[str]
'color'A string or list of string that contain design elements to plot by. Can contain color. If associated maps or sequences are specified, they are used to map by column values to designs. Otherwise, default values are used.
colorstr |
list[str] |
None
NoneA column or list of columns that contain color values. The value is used for a plot by on color. See color_discrete_map for additional behaviors.
pattern_shapestr |
list[str] |
None
NoneA column or list of columns that contain pattern shape values. The value is used for a plot by on pattern shape. See pattern_shape_map for additional behaviors.
labelsdict[str, str] |
None
NoneA dictionary of labels mapping columns to new labels.
color_discrete_sequencelist[str] |
None
NoneA list of colors to sequentially apply to the series. The colors loop, so if there are more series than colors, colors will be reused.
color_discrete_mapdict[str | tuple[str], str] |
None
NoneIf dict, the keys should be strings of the column values (or a tuple of combinations of column values) which map to colors.
pattern_shape_sequencelist[str] |
None
NoneA list of patterns to sequentially apply to the series. The patterns loop, so if there are more series than patterns, patterns will be reused.
pattern_shape_mapdict[str | tuple[str], str] |
None
NoneIf dict, the keys should be strings of the column values (or a tuple of combinations of column values) which map to patterns.
marginalstr |
None
NoneThe type of marginal; histogram, violin, rug, box
opacityfloat |
None
NoneOpacity to apply to all markers. 0 is completely transparent and 1 is completely opaque.
orientationLiteral['v', 'h'] |
None
NoneThe orientation of the bars. If 'v', the bars are vertical. If 'h', the bars are horizontal. Defaults to 'v' if x is specified. Defaults to 'h' if only y is specified.
barmodestr'group'If 'relative', bars are stacked. If 'overlay', bars are drawn on top of each other. If 'group', bars are drawn next to each other.
barnormstrNoneIf 'fraction', the value of the bar is divided by all bars at that location. If 'percentage', the result is the same but multiplied by 100.
histnormstrNoneIf 'probability', the value at this bin is divided out of the total of all bins in this column. If 'percent', result is the same as 'probability' but multiplied by 100. If 'density', the value is divided by the width of the bin. If 'probability density', the value is divided out of the total of all bins in this column and the width of the bin.
log_xboolFalseA boolean that specifies if the corresponding axis is a log axis or not.
log_yboolFalseA boolean that specifies if the corresponding axis is a log axis or not.
range_xlist[int] |
None
NoneA list of two numbers that specify the range of the x-axis.
range_ylist[int] |
None
NoneA list of two numbers that specify the range of the y-axis.
range_binslist[int]NoneA list of two numbers that specify the range of data that is used.
histfuncstrNoneThe function to use when aggregating within bins. One of 'abs_sum', 'avg', 'count', 'count_distinct', 'max', 'median', 'min', 'std', 'sum', or 'var' Defaults to 'count' if only one of x or y is specified and 'sum' if both are.
cumulativeboolFalseIf True, values are cumulative.
nbinsint10The number of bins to use.
text_autobool |
str
FalseIf True, display the value at each bar. If a string, specifies a plotly texttemplate.
titlestr |
None
NoneThe title of the chart
templatestr |
None
NoneThe template for the chart.
unsafe_update_figureCallable<function default_callback>An update function that takes a plotly figure as an argument and optionally returns a plotly figure. If a figure is not returned, the plotly figure passed will be assumed to be the return value. Used to add any custom changes to the underlying plotly figure. Note that the existing data traces should not be removed. This may lead to unexpected behavior if traces are modified in a way that break data mappings.