Normalizing Histograms

Question:

Hi I am plotting three different histograms which have different total frequencies but I want to normalize them such that the frequencies are the same.

enter image description here

As you can see from the picture, the three sets have different total frequencies but I want to normalize them so that they have the same total frequencies but that I want to preserve the proportion of the frequency at each value of the x-axis.

Here’s the code I am using to plot the histograms

setA = [22.972972972972972, 0.0, 0.0, 27.5, 25.0, 18.64406779661017, 8.88888888888889, 20.512820512820515, 11.11111111111111, 15.151515151515152, 17.741935483870968, 13.333333333333334, 16.923076923076923, 12.820512820512821, 27.77777777777778, 4.0, 0.0, 15.625, 14.814814814814815, 7.142857142857143, 15.384615384615385, 14.545454545454545, 38.095238095238095, 17.647058823529413, 21.951219512195124, 21.428571428571427, 32.432432432432435, 10.526315789473685, 36.8421052631579, 13.114754098360656, 17.91044776119403, 12.64367816091954, 16.0, 22.727272727272727, 18.181818181818183, 9.523809523809524, 17.105263157894736, 11.904761904761905, 20.58823529411765, 10.714285714285714, 15.686274509803921, 27.5, 16.129032258064516, 21.333333333333332, 40.90909090909091, 11.904761904761905, 13.157894736842104]
setB = [1.492537313432836, 3.5714285714285716, 17.94871794871795, 11.363636363636363, 13.513513513513514, 14.285714285714286, 15.686274509803921, 17.94871794871795, 9.090909090909092, 41.07142857142857, 10.714285714285714, 25.0, 20.0, 40.0, 13.333333333333334, 13.793103448275861, 3.5714285714285716, 17.073170731707318, 25.675675675675677, 15.625, 17.46031746031746, 8.333333333333334, 18.64406779661017, 14.285714285714286, 0.0, 6.0606060606060606, 6.976744186046512, 18.181818181818183, 26.785714285714285, 22.80701754385965, 6.666666666666667, 12.5]
setC = [13.846153846153847, 23.076923076923077, 25.0, 10.714285714285714, 16.666666666666668, 9.75609756097561, 10.0, 10.0, 17.857142857142858, 20.0, 9.75609756097561, 26.470588235294116, 12.5, 13.333333333333334, 4.3478260869565215, 5.882352941176471, 14.545454545454545, 13.333333333333334, 8.571428571428571, 11.764705882352942, 0.0]

plt.figure('sets')
n, bins, patches = plt.hist(setA, 20, alpha=0.40 , label = 'setA')  
n, bins, patches = plt.hist(setB, 20, alpha=0.40 , label = 'setB')
n, bins, patches = plt.hist(setC, 20, alpha=0.40 , label = 'setC')    
plt.xlabel('Set')
plt.ylabel('Frequency')
plt.title('Different Sets that need to be normalised')

plt.legend()
plt.grid(True)
plt.show()

As a plus, because my aim is to be able to compare the distribution of the three sets, is there a better visual of the histogram I can use to compare them better graphically.

Asked By: piccolo

||

Answers:

You could normalise the histograms using the normed=True option. This will mean that the area of all histograms will add up to 1.

You could also make the plot look a bit tidier by using the same fixed bins for all three histograms (using the bins option to hist: bins = np.arange(0,48,2), for example).

Try this:

import numpy as np

...

mybins = np.arange(0,48,2)

n, bins, patches = plt.hist(setA, bins=mybins, alpha=0.40 , label = 'setA', normed=True)  
n, bins, patches = plt.hist(setB, bins=mybins, alpha=0.40 , label = 'setB', normed=True)
n, bins, patches = plt.hist(setC, bins=mybins, alpha=0.40 , label = 'setC', normed=True)   

enter image description here


Another option is to plot all three histograms in one call to plt.hist, in which case you can used the stacked=True option, which can further clean up your plot.

Note: this method normalises all three histograms, so the total integral is 1. It does not make all three histograms add up to the same value.

n, bins, patches = plt.hist([setA,setB,setC], bins=mybins, 
                            label = ['setA','setB','setC'], 
                            normed=True, stacked=True)

enter image description here


Or, finally, if a stacked histogram is not to your taste, you can plot the bars next to each other, by again plotting all three histograms in one call, but removing the stacked=True option from the line above:

n, bins, patches = plt.hist([setA,setB,setC], bins=mybins, 
                            label = ['setA','setB','setC'], 
                            normed=True)

enter image description here


As discussed in comments, when used stacked=True, the normed option just means the sum of all three histograms will equal 1, so they may not be normalised in the same way as in the other methods.

To counter this, we can use np.histogram, and plot the results using plt.bar.

For example, using the same data sets from above:

mybins = np.arange(0,48,2)

nA,binsA = np.histogram(setA,bins=mybins,normed=True)
nB,binsB = np.histogram(setB,bins=mybins,normed=True)
nC,binsC = np.histogram(setC,bins=mybins,normed=True)

# Since the sum of each of these will be 1., lets divide by 3.,
# so the sum of the stacked histogram will be 1.
nA/=3.
nB/=3.
nC/=3.

# Use bottom= to set where the bars should begin
plt.bar(binsA[:-1],nA,width=2,color='b',label='setA')
plt.bar(binsB[:-1],nB,width=2,color='g',label='setB',bottom=nA)
plt.bar(binsC[:-1],nC,width=2,color='r',label='setC',bottom=nA+nB)

enter image description here

Answered By: tmdavison

I personally like this function:

def get_histogram(array: np.ndarray,
                  xlabel: str,
                  ylabel: str,
                  title: str,

                  dpi=200,  # dots per inch,
                  facecolor: str = 'white',
                  bins: int = None,
                  show: bool = False,
                  tight_layout=False,
                  linestyle: Optional[str] = '--',
                  alpha: float = 0.75,
                  edgecolor: str = "black",
                  stat: Optional = 'count',
                  color: Optional[str] = None,
                  ):
    """ """
    # - check it's of size (N,)
    if isinstance(array, list):
        array: np.ndarray = np.array(array)
    assert array.shape == (array.shape[0],)
    assert len(array.shape) == 1
    assert isinstance(array.shape[0], int)
    # -
    n: int = array.shape[0]
    if bins is None:
        bins: int = get_num_bins(n, option='square_root')
        # bins: int = get_num_bins(n, option='square_root')
    print(f'using this number of {bins=} and data size is {n=}')
    # -
    fig = plt.figure(dpi=dpi)
    fig.patch.set_facecolor(facecolor)

    import seaborn as sns
    p = sns.histplot(array, stat=stat, color=color)
    # n, bins, patches = plt.hist(array, bins=bins, facecolor='b', alpha=alpha, edgecolor=edgecolor, density=True)

    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    # plt.xlim(40, 160)
    # plt.ylim(0, 0.03)
    plt.grid(linestyle=linestyle) if linestyle else None
    plt.tight_layout() if tight_layout else None
    plt.show() if show else None

sample plot:
enter image description here

Answered By: Charlie Parker
  • This can be accomplish with seaborn.histplot, or seaborn.displot with kind='hist'.
  • There are three primary parameters that will be of interest in relation to the question.
    • common_norm: If True and using a normalized statistic, the normalization will apply over the full dataset. Otherwise, normalize each histogram independently.
    • multiple: {'layer', 'dodge', 'stack', 'fill'} – how the multiple groups of data are presented.
    • stat: Aggregate statistic to compute in each bin, and dependent axis will label corresponding to the selected stat
      • 'probability': normalize such that bar heights sum to 1
      • 'density': normalize such that the total area of the histogram equals 1
      • There’s also 'count', 'frequency', & 'percent'
  • Tested in python 3.11.2, pandas 2.0.0, matplotlib 3.7.1, seaborn 0.12.2

Imports and Sample Data

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# using the sample sets from the OP
data = {'A': setA, 'B': setB, 'C': setC}

# set some custom bins to compare against the other answer
bins=np.arange(0, 48, 2)

Plots

fig, ax = plt.subplots(figsize=(6.4, 4.3))
sns.histplot(data=data, stat='density', common_norm=True, multiple='dodge', bins=np.arange(0, 48, 2), ax=ax)

enter image description here

g = sns.displot(data=data, kind='hist', stat='density', common_norm=True, multiple='stack', bins=np.arange(0, 48, 2), height=4, aspect=1.25)

enter image description here

Answered By: Trenton McKinney
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.