Histogram Plot

A histogram plot is a data visualization technique commonly used in statistics and data analysis to visualize the distribution of a single continuous variable. It consists of a series of contiguous, non-overlapping bars that provide a visual summary of the frequency or density of data points within predefined intervals or “bins.” The number of bins significantly impacts the visualization.

Histograms are appropriate when the data contain a continuous variable of interest. If there is an additional categorical variable that the variable of interest depends on, layered histograms may be appropriate using the by argument.

What are histograms useful for?

Data distribution analysis: Histograms are a valuable tool to gain insights into the distribution of a dataset, making it easier to understand the central tendencies, spread, and skewness of the data.
Identifying outliers: Histograms help in detecting outliers or anomalies in a dataset by highlighting data points that fall outside the typical distribution.
Density estimation: Histograms can serve as the basis for density estimation methods, helping to model and understand underlying data distributions, which is crucial in statistical analysis and machine learning.

Examples

A basic histogram

Visualize the distribution of a single variable by passing the column name to the x or y arguments.

import deephaven.plot.express as dx
iris = dx.data.iris()

# subset to get specific species
setosa = iris.where("Species == `setosa`")

# control the plot orientation using `x` or `y`
hist_plot_x = dx.histogram(setosa, x="SepalLength")
hist_plot_y = dx.histogram(setosa, y="SepalLength")

Modify the bin size by setting nbins equal to the number of desired bins.

import deephaven.plot.express as dx
iris = dx.data.iris()

# subset to get specific species
virginica = iris.where("Species == `virginica`")

# too many bins will produce jagged, disconnected histograms
hist_20_bins = dx.histogram(virginica, x="SepalLength", nbins=20)

# too few bins will mask distributional information
hist_3_bins = dx.histogram(virginica, x="SepalLength", nbins=3)

# play with the `nbins` parameter to get a good visualization
hist_8_bins = dx.histogram(virginica, x="SepalLength", nbins=8)

Bin and aggregate on different columns

If the plot orientation is vertical ("v"), the x column is binned and the y column is aggregated. The operations are flipped if the plot orientation is horizontal.

import deephaven.plot.express as dx
iris = dx.data.iris()

# subset to get specific species
setosa = iris.where("Species == `setosa`")

# The default orientation is "v" (vertical) and the default aggregation function is "sum"
hist_v = dx.histogram(setosa, x="SepalLength", y="SepalWidth")

# Control the plot orientation using orientation
hist_h = dx.histogram(setosa, x="SepalLength", y="SepalWidth", orientation="h")

# Control the aggregation function using histfunc
hist_avg = dx.histogram(setosa, x="SepalLength", y="SepalWidth", histfunc="avg")

Distributions of several groups

Histograms can also be used to compare the distributional properties of different groups of data, though they may be a little harder to read than box plots or violin plots. Pass the name of the grouping column(s) to the by argument.

import deephaven.plot.express as dx
iris = dx.data.iris()

# each bin may be stacked side-by-side for each group
stacked_hist = dx.histogram(iris, x="SepalLength", by="Species")

# or, each bin may be overlaid with the others
overlay_hist = dx.histogram(iris, x="SepalLength", by="Species", barmode="overlay")

API Reference

Returns a histogram

Returns: DeephavenFigure A DeephavenFigure that contains the histogram

Parameters	Type	Default	Description
table	PartitionedTable \| Table \| DataFrame		A table to pull data from.
x	str \| list[str] \| None	None	A column name or list of columns that contain x-axis values. Column values must be numeric. If x is specified, the bars are drawn vertically by default.
y	str \| list[str] \| None	None	A column name or list of columns that contain y-axis values. Column values must be numeric. If only y is specified, the bars are drawn horizontally by default.
by	str \| list[str] \| None	None	A column or list of columns that contain values to plot the figure traces by. All values or combination of values map to a unique design. The variable by_vars specifies which design elements are used. This is overriden if any specialized design variables such as color are specified
by_vars	str \| list[str]	'color'	A string or list of string that contain design elements to plot by. Can contain color. If associated maps or sequences are specified, they are used to map by column values to designs. Otherwise, default values are used.
filter_by	str \| list[str] \| bool \| None	None	A column or list of columns that contain values to filter the chart by. If a boolean is passed and the table is partitioned, all partition key columns used to create the partitions are used. If no filters are specified, all partitions are shown on the chart.
required_filter_by	str \| list[str] \| bool \| None	None	A column or list of columns that contain values to filter the chart by. Values set in input filters or linkers for the relevant columns determine the exact values to display. If a boolean is passed and the table is partitioned, all partition key columns used to create the partitions are used. All required input filters or linkers must be set for the chart to display any data.
color	str \| list[str] \| None	None	A column or list of columns that contain color values. The value is used for a plot by on color. See color_discrete_map for additional behaviors.
pattern_shape	str \| list[str] \| None	None	A column or list of columns that contain pattern shape values. The value is used for a plot by on pattern shape. See pattern_shape_map for additional behaviors.
labels	dict[str, str] \| None	None	A dictionary of labels mapping columns to new labels.
color_discrete_sequence	list[str] \| None	None	A list of colors to sequentially apply to the series. The colors loop, so if there are more series than colors, colors will be reused.
color_discrete_map	dict[str \| tuple[str], str] \| None	None	If dict, the keys should be strings of the column values (or a tuple of combinations of column values) which map to colors.
pattern_shape_sequence	list[str] \| None	None	A list of patterns to sequentially apply to the series. The patterns loop, so if there are more series than patterns, patterns will be reused.
pattern_shape_map	dict[str \| tuple[str], str] \| None	None	If dict, the keys should be strings of the column values (or a tuple of combinations of column values) which map to patterns.
marginal	str \| None	None	The type of marginal; histogram, violin, rug, box
opacity	float \| None	None	Opacity to apply to all markers. 0 is completely transparent and 1 is completely opaque.
orientation	Literal['v', 'h'] \| None	None	The orientation of the bars. If 'v', the bars are vertical. If 'h', the bars are horizontal. Defaults to 'v' if x is specified. Defaults to 'h' if only y is specified.
barmode	str	'group'	If 'relative', bars are stacked. If 'overlay', bars are drawn on top of each other. If 'group', bars are drawn next to each other.
barnorm	str	None	If 'fraction', the value of the bar is divided by all bars at that location. If 'percentage', the result is the same but multiplied by 100.
histnorm	str	None	If 'probability', the value at this bin is divided out of the total of all bins in this column. If 'percent', result is the same as 'probability' but multiplied by 100. If 'density', the value is divided by the width of the bin. If 'probability density', the value is divided out of the total of all bins in this column and the width of the bin.
log_x	bool	False	A boolean that specifies if the corresponding axis is a log axis or not.
log_y	bool	False	A boolean that specifies if the corresponding axis is a log axis or not.
range_x	list[int] \| None	None	A list of two numbers that specify the range of the x-axis.
range_y	list[int] \| None	None	A list of two numbers that specify the range of the y-axis.
range_bins	list[int]	None	A list of two numbers that specify the range of data that is used.
histfunc	str	None	The function to use when aggregating within bins. One of 'abs_sum', 'avg', 'count', 'count_distinct', 'max', 'median', 'min', 'std', 'sum', or 'var' Defaults to 'count' if only one of x or y is specified and 'sum' if both are.
cumulative	bool	False	If True, values are cumulative.
nbins	int	10	The number of bins to use.
text_auto	bool \| str	False	If True, display the value at each bar. If a string, specifies a plotly texttemplate.
title	str \| None	None	The title of the chart
template	str \| None	None	The template for the chart.
unsafe_update_figure	Callable	<function default_callback>	An update function that takes a plotly figure as an argument and optionally returns a plotly figure. If a figure is not returned, the plotly figure passed will be assumed to be the return value. Used to add any custom changes to the underlying plotly figure. Note that the existing data traces should not be removed. This may lead to unexpected behavior if traces are modified in a way that break data mappings.