Normalizing Histograms
Question:
Hi I am plotting three different histograms which have different total frequencies but I want to normalize them such that the frequencies are the same.
As you can see from the picture, the three sets have different total frequencies but I want to normalize them so that they have the same total frequencies but that I want to preserve the proportion of the frequency at each value of the x-axis.
Here’s the code I am using to plot the histograms
setA = [22.972972972972972, 0.0, 0.0, 27.5, 25.0, 18.64406779661017, 8.88888888888889, 20.512820512820515, 11.11111111111111, 15.151515151515152, 17.741935483870968, 13.333333333333334, 16.923076923076923, 12.820512820512821, 27.77777777777778, 4.0, 0.0, 15.625, 14.814814814814815, 7.142857142857143, 15.384615384615385, 14.545454545454545, 38.095238095238095, 17.647058823529413, 21.951219512195124, 21.428571428571427, 32.432432432432435, 10.526315789473685, 36.8421052631579, 13.114754098360656, 17.91044776119403, 12.64367816091954, 16.0, 22.727272727272727, 18.181818181818183, 9.523809523809524, 17.105263157894736, 11.904761904761905, 20.58823529411765, 10.714285714285714, 15.686274509803921, 27.5, 16.129032258064516, 21.333333333333332, 40.90909090909091, 11.904761904761905, 13.157894736842104]
setB = [1.492537313432836, 3.5714285714285716, 17.94871794871795, 11.363636363636363, 13.513513513513514, 14.285714285714286, 15.686274509803921, 17.94871794871795, 9.090909090909092, 41.07142857142857, 10.714285714285714, 25.0, 20.0, 40.0, 13.333333333333334, 13.793103448275861, 3.5714285714285716, 17.073170731707318, 25.675675675675677, 15.625, 17.46031746031746, 8.333333333333334, 18.64406779661017, 14.285714285714286, 0.0, 6.0606060606060606, 6.976744186046512, 18.181818181818183, 26.785714285714285, 22.80701754385965, 6.666666666666667, 12.5]
setC = [13.846153846153847, 23.076923076923077, 25.0, 10.714285714285714, 16.666666666666668, 9.75609756097561, 10.0, 10.0, 17.857142857142858, 20.0, 9.75609756097561, 26.470588235294116, 12.5, 13.333333333333334, 4.3478260869565215, 5.882352941176471, 14.545454545454545, 13.333333333333334, 8.571428571428571, 11.764705882352942, 0.0]
plt.figure('sets')
n, bins, patches = plt.hist(setA, 20, alpha=0.40 , label = 'setA')
n, bins, patches = plt.hist(setB, 20, alpha=0.40 , label = 'setB')
n, bins, patches = plt.hist(setC, 20, alpha=0.40 , label = 'setC')
plt.xlabel('Set')
plt.ylabel('Frequency')
plt.title('Different Sets that need to be normalised')
plt.legend()
plt.grid(True)
plt.show()
As a plus, because my aim is to be able to compare the distribution of the three sets, is there a better visual of the histogram I can use to compare them better graphically.
Answers:
You could normalise the histograms using the normed=True
option. This will mean that the area of all histograms will add up to 1.
You could also make the plot look a bit tidier by using the same fixed bins for all three histograms (using the bins
option to hist
: bins = np.arange(0,48,2)
, for example).
Try this:
import numpy as np
...
mybins = np.arange(0,48,2)
n, bins, patches = plt.hist(setA, bins=mybins, alpha=0.40 , label = 'setA', normed=True)
n, bins, patches = plt.hist(setB, bins=mybins, alpha=0.40 , label = 'setB', normed=True)
n, bins, patches = plt.hist(setC, bins=mybins, alpha=0.40 , label = 'setC', normed=True)
Another option is to plot all three histograms in one call to plt.hist, in which case you can used the stacked=True
option, which can further clean up your plot.
Note: this method normalises all three histograms, so the total integral is 1. It does not make all three histograms add up to the same value.
n, bins, patches = plt.hist([setA,setB,setC], bins=mybins,
label = ['setA','setB','setC'],
normed=True, stacked=True)
Or, finally, if a stacked histogram is not to your taste, you can plot the bars next to each other, by again plotting all three histograms in one call, but removing the stacked=True
option from the line above:
n, bins, patches = plt.hist([setA,setB,setC], bins=mybins,
label = ['setA','setB','setC'],
normed=True)
As discussed in comments, when used stacked=True
, the normed
option just means the sum of all three histograms will equal 1, so they may not be normalised in the same way as in the other methods.
To counter this, we can use np.histogram
, and plot the results using plt.bar
.
For example, using the same data sets from above:
mybins = np.arange(0,48,2)
nA,binsA = np.histogram(setA,bins=mybins,normed=True)
nB,binsB = np.histogram(setB,bins=mybins,normed=True)
nC,binsC = np.histogram(setC,bins=mybins,normed=True)
# Since the sum of each of these will be 1., lets divide by 3.,
# so the sum of the stacked histogram will be 1.
nA/=3.
nB/=3.
nC/=3.
# Use bottom= to set where the bars should begin
plt.bar(binsA[:-1],nA,width=2,color='b',label='setA')
plt.bar(binsB[:-1],nB,width=2,color='g',label='setB',bottom=nA)
plt.bar(binsC[:-1],nC,width=2,color='r',label='setC',bottom=nA+nB)
I personally like this function:
def get_histogram(array: np.ndarray,
xlabel: str,
ylabel: str,
title: str,
dpi=200, # dots per inch,
facecolor: str = 'white',
bins: int = None,
show: bool = False,
tight_layout=False,
linestyle: Optional[str] = '--',
alpha: float = 0.75,
edgecolor: str = "black",
stat: Optional = 'count',
color: Optional[str] = None,
):
""" """
# - check it's of size (N,)
if isinstance(array, list):
array: np.ndarray = np.array(array)
assert array.shape == (array.shape[0],)
assert len(array.shape) == 1
assert isinstance(array.shape[0], int)
# -
n: int = array.shape[0]
if bins is None:
bins: int = get_num_bins(n, option='square_root')
# bins: int = get_num_bins(n, option='square_root')
print(f'using this number of {bins=} and data size is {n=}')
# -
fig = plt.figure(dpi=dpi)
fig.patch.set_facecolor(facecolor)
import seaborn as sns
p = sns.histplot(array, stat=stat, color=color)
# n, bins, patches = plt.hist(array, bins=bins, facecolor='b', alpha=alpha, edgecolor=edgecolor, density=True)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
# plt.xlim(40, 160)
# plt.ylim(0, 0.03)
plt.grid(linestyle=linestyle) if linestyle else None
plt.tight_layout() if tight_layout else None
plt.show() if show else None
- This can be accomplish with
seaborn.histplot
, or seaborn.displot
with kind='hist'
.
seaborn
is a high-level API for matplotlib
- Figure-level vs. axes-level functions
- There are three primary parameters that will be of interest in relation to the question.
common_norm
: If True
and using a normalized statistic, the normalization will apply over the full dataset. Otherwise, normalize each histogram independently.
multiple
: {'layer', 'dodge', 'stack', 'fill'}
– how the multiple groups of data are presented.
stat
: Aggregate statistic to compute in each bin, and dependent axis will label corresponding to the selected stat
'probability'
: normalize such that bar heights sum to 1
'density'
: normalize such that the total area of the histogram equals 1
- There’s also
'count'
, 'frequency'
, & 'percent'
- Tested in
python 3.11.2
, pandas 2.0.0
, matplotlib 3.7.1
, seaborn 0.12.2
Imports and Sample Data
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# using the sample sets from the OP
data = {'A': setA, 'B': setB, 'C': setC}
# set some custom bins to compare against the other answer
bins=np.arange(0, 48, 2)
Plots
fig, ax = plt.subplots(figsize=(6.4, 4.3))
sns.histplot(data=data, stat='density', common_norm=True, multiple='dodge', bins=np.arange(0, 48, 2), ax=ax)
g = sns.displot(data=data, kind='hist', stat='density', common_norm=True, multiple='stack', bins=np.arange(0, 48, 2), height=4, aspect=1.25)
Hi I am plotting three different histograms which have different total frequencies but I want to normalize them such that the frequencies are the same.
As you can see from the picture, the three sets have different total frequencies but I want to normalize them so that they have the same total frequencies but that I want to preserve the proportion of the frequency at each value of the x-axis.
Here’s the code I am using to plot the histograms
setA = [22.972972972972972, 0.0, 0.0, 27.5, 25.0, 18.64406779661017, 8.88888888888889, 20.512820512820515, 11.11111111111111, 15.151515151515152, 17.741935483870968, 13.333333333333334, 16.923076923076923, 12.820512820512821, 27.77777777777778, 4.0, 0.0, 15.625, 14.814814814814815, 7.142857142857143, 15.384615384615385, 14.545454545454545, 38.095238095238095, 17.647058823529413, 21.951219512195124, 21.428571428571427, 32.432432432432435, 10.526315789473685, 36.8421052631579, 13.114754098360656, 17.91044776119403, 12.64367816091954, 16.0, 22.727272727272727, 18.181818181818183, 9.523809523809524, 17.105263157894736, 11.904761904761905, 20.58823529411765, 10.714285714285714, 15.686274509803921, 27.5, 16.129032258064516, 21.333333333333332, 40.90909090909091, 11.904761904761905, 13.157894736842104]
setB = [1.492537313432836, 3.5714285714285716, 17.94871794871795, 11.363636363636363, 13.513513513513514, 14.285714285714286, 15.686274509803921, 17.94871794871795, 9.090909090909092, 41.07142857142857, 10.714285714285714, 25.0, 20.0, 40.0, 13.333333333333334, 13.793103448275861, 3.5714285714285716, 17.073170731707318, 25.675675675675677, 15.625, 17.46031746031746, 8.333333333333334, 18.64406779661017, 14.285714285714286, 0.0, 6.0606060606060606, 6.976744186046512, 18.181818181818183, 26.785714285714285, 22.80701754385965, 6.666666666666667, 12.5]
setC = [13.846153846153847, 23.076923076923077, 25.0, 10.714285714285714, 16.666666666666668, 9.75609756097561, 10.0, 10.0, 17.857142857142858, 20.0, 9.75609756097561, 26.470588235294116, 12.5, 13.333333333333334, 4.3478260869565215, 5.882352941176471, 14.545454545454545, 13.333333333333334, 8.571428571428571, 11.764705882352942, 0.0]
plt.figure('sets')
n, bins, patches = plt.hist(setA, 20, alpha=0.40 , label = 'setA')
n, bins, patches = plt.hist(setB, 20, alpha=0.40 , label = 'setB')
n, bins, patches = plt.hist(setC, 20, alpha=0.40 , label = 'setC')
plt.xlabel('Set')
plt.ylabel('Frequency')
plt.title('Different Sets that need to be normalised')
plt.legend()
plt.grid(True)
plt.show()
As a plus, because my aim is to be able to compare the distribution of the three sets, is there a better visual of the histogram I can use to compare them better graphically.
You could normalise the histograms using the normed=True
option. This will mean that the area of all histograms will add up to 1.
You could also make the plot look a bit tidier by using the same fixed bins for all three histograms (using the bins
option to hist
: bins = np.arange(0,48,2)
, for example).
Try this:
import numpy as np
...
mybins = np.arange(0,48,2)
n, bins, patches = plt.hist(setA, bins=mybins, alpha=0.40 , label = 'setA', normed=True)
n, bins, patches = plt.hist(setB, bins=mybins, alpha=0.40 , label = 'setB', normed=True)
n, bins, patches = plt.hist(setC, bins=mybins, alpha=0.40 , label = 'setC', normed=True)
Another option is to plot all three histograms in one call to plt.hist, in which case you can used the stacked=True
option, which can further clean up your plot.
Note: this method normalises all three histograms, so the total integral is 1. It does not make all three histograms add up to the same value.
n, bins, patches = plt.hist([setA,setB,setC], bins=mybins,
label = ['setA','setB','setC'],
normed=True, stacked=True)
Or, finally, if a stacked histogram is not to your taste, you can plot the bars next to each other, by again plotting all three histograms in one call, but removing the stacked=True
option from the line above:
n, bins, patches = plt.hist([setA,setB,setC], bins=mybins,
label = ['setA','setB','setC'],
normed=True)
As discussed in comments, when used stacked=True
, the normed
option just means the sum of all three histograms will equal 1, so they may not be normalised in the same way as in the other methods.
To counter this, we can use np.histogram
, and plot the results using plt.bar
.
For example, using the same data sets from above:
mybins = np.arange(0,48,2)
nA,binsA = np.histogram(setA,bins=mybins,normed=True)
nB,binsB = np.histogram(setB,bins=mybins,normed=True)
nC,binsC = np.histogram(setC,bins=mybins,normed=True)
# Since the sum of each of these will be 1., lets divide by 3.,
# so the sum of the stacked histogram will be 1.
nA/=3.
nB/=3.
nC/=3.
# Use bottom= to set where the bars should begin
plt.bar(binsA[:-1],nA,width=2,color='b',label='setA')
plt.bar(binsB[:-1],nB,width=2,color='g',label='setB',bottom=nA)
plt.bar(binsC[:-1],nC,width=2,color='r',label='setC',bottom=nA+nB)
I personally like this function:
def get_histogram(array: np.ndarray,
xlabel: str,
ylabel: str,
title: str,
dpi=200, # dots per inch,
facecolor: str = 'white',
bins: int = None,
show: bool = False,
tight_layout=False,
linestyle: Optional[str] = '--',
alpha: float = 0.75,
edgecolor: str = "black",
stat: Optional = 'count',
color: Optional[str] = None,
):
""" """
# - check it's of size (N,)
if isinstance(array, list):
array: np.ndarray = np.array(array)
assert array.shape == (array.shape[0],)
assert len(array.shape) == 1
assert isinstance(array.shape[0], int)
# -
n: int = array.shape[0]
if bins is None:
bins: int = get_num_bins(n, option='square_root')
# bins: int = get_num_bins(n, option='square_root')
print(f'using this number of {bins=} and data size is {n=}')
# -
fig = plt.figure(dpi=dpi)
fig.patch.set_facecolor(facecolor)
import seaborn as sns
p = sns.histplot(array, stat=stat, color=color)
# n, bins, patches = plt.hist(array, bins=bins, facecolor='b', alpha=alpha, edgecolor=edgecolor, density=True)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
# plt.xlim(40, 160)
# plt.ylim(0, 0.03)
plt.grid(linestyle=linestyle) if linestyle else None
plt.tight_layout() if tight_layout else None
plt.show() if show else None
- This can be accomplish with
seaborn.histplot
, orseaborn.displot
withkind='hist'
.seaborn
is a high-level API formatplotlib
- Figure-level vs. axes-level functions
- There are three primary parameters that will be of interest in relation to the question.
common_norm
: IfTrue
and using a normalized statistic, the normalization will apply over the full dataset. Otherwise, normalize each histogram independently.multiple
:{'layer', 'dodge', 'stack', 'fill'}
– how the multiple groups of data are presented.stat
: Aggregate statistic to compute in each bin, and dependent axis will label corresponding to the selectedstat
'probability'
: normalize such that bar heights sum to 1'density'
: normalize such that the total area of the histogram equals 1- There’s also
'count'
,'frequency'
, &'percent'
- Tested in
python 3.11.2
,pandas 2.0.0
,matplotlib 3.7.1
,seaborn 0.12.2
Imports and Sample Data
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# using the sample sets from the OP
data = {'A': setA, 'B': setB, 'C': setC}
# set some custom bins to compare against the other answer
bins=np.arange(0, 48, 2)
Plots
fig, ax = plt.subplots(figsize=(6.4, 4.3))
sns.histplot(data=data, stat='density', common_norm=True, multiple='dodge', bins=np.arange(0, 48, 2), ax=ax)
g = sns.displot(data=data, kind='hist', stat='density', common_norm=True, multiple='stack', bins=np.arange(0, 48, 2), height=4, aspect=1.25)