Histogram Plot
A histogram plot is a data visualization technique commonly used in statistics and data analysis to visualize the distribution of a single continuous variable. It consists of a series of contiguous, non-overlapping bars that provide a visual summary of the frequency or density of data points within predefined intervals or “bins.” The number of bins significantly impacts the visualization.
Histograms are appropriate when the data contain a continuous variable of interest. If there is an additional categorical variable that the variable of interest depends on, layered histograms may be appropriate using the by
argument.
What are histograms useful for?
- Data distribution analysis: Histograms are a valuable tool to gain insights into the distribution of a dataset, making it easier to understand the central tendencies, spread, and skewness of the data.
- Identifying outliers: Histograms help in detecting outliers or anomalies in a dataset by highlighting data points that fall outside the typical distribution.
- Density estimation: Histograms can serve as the basis for density estimation methods, helping to model and understand underlying data distributions, which is crucial in statistical analysis and machine learning.
Examples
A basic histogram
Visualize the distribution of a single variable by passing the column name to the x
or y
arguments.
import deephaven.plot.express as dx
iris = dx.data.iris()
# subset to get specific species
setosa = iris.where("Species == `setosa`")
# control the plot orientation using `x` or `y`
hist_plot_x = dx.histogram(setosa, x="SepalLength")
hist_plot_y = dx.histogram(setosa, y="SepalLength")
Modify the bin size by setting nbins
equal to the number of desired bins.
import deephaven.plot.express as dx
iris = dx.data.iris()
# subset to get specific species
virginica = iris.where("Species == `virginica`")
# too many bins will produce jagged, disconnected histograms
hist_20_bins = dx.histogram(setosa, x="SepalLength", nbins=20)
# too few bins will mask distributional information
hist_3_bins = dx.histogram(setosa, x="SepalLength", nbins=3)
# play with the `nbins` parameter to get a good visualization
hist_8_bins = dx.histogram(setosa, x="SepalLength", nbins=8)
Bin and aggregate on different columns
If the plot orientation is vertical ("v"
), the x
column is binned and the y
column is aggregated. The operations are flipped if the plot orientation is horizontal.
import deephaven.plot.express as dx
iris = dx.data.iris()
# subset to get specific species
setosa = iris.where("Species == `setosa`")
# The default orientation is "v" (vertical) and the default aggregation function is "sum"
hist_v = dx.histogram(setosa, x="SepalLength", y="SepalWidth")
# Control the plot orientation using orientation
hist_h = dx.histogram(setosa, x="SepalLength", y="SepalWidth", orientation="h")
# Control the aggregation function using histfunc
hist_avg = dx.histogram(setosa, x="SepalLength", y="SepalWidth", histfunc="avg")
Distributions of several groups
Histograms can also be used to compare the distributional properties of different groups of data, though they may be a little harder to read than box plots or violin plots. Pass the name of the grouping column(s) to the by
argument.
import deephaven.plot.express as dx
iris = dx.data.iris()
# each bin may be stacked side-by-side for each group
stacked_hist = dx.histogram(iris, x="SepalLength", by="Species")
# or, each bin may be overlaid with the others
overlay_hist = dx.histogram(iris, x="SepalLength", by="Species", barmode="overlay")
API Reference
Returns a histogram
Returns: DeephavenFigure
A DeephavenFigure that contains the histogram
Parameters | Type | Default | Description |
---|---|---|---|
table | PartitionedTable | Table | DataFrame | A table to pull data from. | |
x | str | list[str] | None | None | A column name or list of columns that contain x-axis values. Column values must be numeric. If x is specified, the bars are drawn vertically by default. |
y | str | list[str] | None | None | A column name or list of columns that contain y-axis values. Column values must be numeric. If only y is specified, the bars are drawn horizontally by default. |
by | str | list[str] | None | None | A column or list of columns that contain values to plot the figure traces by. All values or combination of values map to a unique design. The variable by_vars specifies which design elements are used. This is overriden if any specialized design variables such as color are specified |
by_vars | str | list[str] | 'color' | A string or list of string that contain design elements to plot by. Can contain color. If associated maps or sequences are specified, they are used to map by column values to designs. Otherwise, default values are used. |
color | str | list[str] | None | None | A column or list of columns that contain color values. The value is used for a plot by on color. See color_discrete_map for additional behaviors. |
pattern_shape | str | list[str] | None | None | A column or list of columns that contain pattern shape values. The value is used for a plot by on pattern shape. See pattern_shape_map for additional behaviors. |
labels | dict[str, str] | None | None | A dictionary of labels mapping columns to new labels. |
color_discrete_sequence | list[str] | None | None | A list of colors to sequentially apply to the series. The colors loop, so if there are more series than colors, colors will be reused. |
color_discrete_map | dict[str | tuple[str], str] | None | None | If dict, the keys should be strings of the column values (or a tuple of combinations of column values) which map to colors. |
pattern_shape_sequence | list[str] | None | None | A list of patterns to sequentially apply to the series. The patterns loop, so if there are more series than patterns, patterns will be reused. |
pattern_shape_map | dict[str | tuple[str], str] | None | None | If dict, the keys should be strings of the column values (or a tuple of combinations of column values) which map to patterns. |
marginal | str | None | None | The type of marginal; histogram, violin, rug, box |
opacity | float | None | None | Opacity to apply to all markers. 0 is completely transparent and 1 is completely opaque. |
orientation | Literal['v', 'h'] | None | None | The orientation of the bars. If 'v', the bars are vertical. If 'h', the bars are horizontal. Defaults to 'v' if x is specified. Defaults to 'h' if only y is specified. |
barmode | str | 'group' | If 'relative', bars are stacked. If 'overlay', bars are drawn on top of each other. If 'group', bars are drawn next to each other. |
barnorm | str | None | If 'fraction', the value of the bar is divided by all bars at that location. If 'percentage', the result is the same but multiplied by 100. |
histnorm | str | None | If 'probability', the value at this bin is divided out of the total of all bins in this column. If 'percent', result is the same as 'probability' but multiplied by 100. If 'density', the value is divided by the width of the bin. If 'probability density', the value is divided out of the total of all bins in this column and the width of the bin. |
log_x | bool | False | A boolean that specifies if the corresponding axis is a log axis or not. |
log_y | bool | False | A boolean that specifies if the corresponding axis is a log axis or not. |
range_x | list[int] | None | None | A list of two numbers that specify the range of the x-axis. |
range_y | list[int] | None | None | A list of two numbers that specify the range of the y-axis. |
range_bins | list[int] | None | A list of two numbers that specify the range of data that is used. |
histfunc | str | None | The function to use when aggregating within bins. One of 'abs_sum', 'avg', 'count', 'count_distinct', 'max', 'median', 'min', 'std', 'sum', or 'var' Defaults to 'count' if only one of x or y is specified and 'sum' if both are. |
cumulative | bool | False | If True, values are cumulative. |
nbins | int | 10 | The number of bins to use. |
text_auto | bool | str | False | If True, display the value at each bar. If a string, specifies a plotly texttemplate. |
title | str | None | None | The title of the chart |
template | str | None | None | The template for the chart. |
unsafe_update_figure | Callable | <function default_callback> | An update function that takes a plotly figure as an argument and optionally returns a plotly figure. If a figure is not returned, the plotly figure passed will be assumed to be the return value. Used to add any custom changes to the underlying plotly figure. Note that the existing data traces should not be removed. This may lead to unexpected behavior if traces are modified in a way that break data mappings. |