Normalizing Histograms

Question

Hi I am plotting three different histograms which have different total frequencies but I want to normalize them such that the frequencies are the same.

As you can see from the picture, the three sets have different total frequencies but I want to normalize them so that they have the same total frequencies but that I want to preserve the proportion of the frequency at each value of the x-axis.

Here’s the code I am using to plot the histograms

setA = [22.972972972972972, 0.0, 0.0, 27.5, 25.0, 18.64406779661017, 8.88888888888889, 20.512820512820515, 11.11111111111111, 15.151515151515152, 17.741935483870968, 13.333333333333334, 16.923076923076923, 12.820512820512821, 27.77777777777778, 4.0, 0.0, 15.625, 14.814814814814815, 7.142857142857143, 15.384615384615385, 14.545454545454545, 38.095238095238095, 17.647058823529413, 21.951219512195124, 21.428571428571427, 32.432432432432435, 10.526315789473685, 36.8421052631579, 13.114754098360656, 17.91044776119403, 12.64367816091954, 16.0, 22.727272727272727, 18.181818181818183, 9.523809523809524, 17.105263157894736, 11.904761904761905, 20.58823529411765, 10.714285714285714, 15.686274509803921, 27.5, 16.129032258064516, 21.333333333333332, 40.90909090909091, 11.904761904761905, 13.157894736842104]
setB = [1.492537313432836, 3.5714285714285716, 17.94871794871795, 11.363636363636363, 13.513513513513514, 14.285714285714286, 15.686274509803921, 17.94871794871795, 9.090909090909092, 41.07142857142857, 10.714285714285714, 25.0, 20.0, 40.0, 13.333333333333334, 13.793103448275861, 3.5714285714285716, 17.073170731707318, 25.675675675675677, 15.625, 17.46031746031746, 8.333333333333334, 18.64406779661017, 14.285714285714286, 0.0, 6.0606060606060606, 6.976744186046512, 18.181818181818183, 26.785714285714285, 22.80701754385965, 6.666666666666667, 12.5]
setC = [13.846153846153847, 23.076923076923077, 25.0, 10.714285714285714, 16.666666666666668, 9.75609756097561, 10.0, 10.0, 17.857142857142858, 20.0, 9.75609756097561, 26.470588235294116, 12.5, 13.333333333333334, 4.3478260869565215, 5.882352941176471, 14.545454545454545, 13.333333333333334, 8.571428571428571, 11.764705882352942, 0.0]

plt.figure('sets')
n, bins, patches = plt.hist(setA, 20, alpha=0.40 , label = 'setA')  
n, bins, patches = plt.hist(setB, 20, alpha=0.40 , label = 'setB')
n, bins, patches = plt.hist(setC, 20, alpha=0.40 , label = 'setC')    
plt.xlabel('Set')
plt.ylabel('Frequency')
plt.title('Different Sets that need to be normalised')

plt.legend()
plt.grid(True)
plt.show()

As a plus, because my aim is to be able to compare the distribution of the three sets, is there a better visual of the histogram I can use to compare them better graphically.

Asked By: piccolo

||

Source

Answer 1

You could normalise the histograms using the normed=True option. This will mean that the area of all histograms will add up to 1.

You could also make the plot look a bit tidier by using the same fixed bins for all three histograms (using the bins option to hist: bins = np.arange(0,48,2), for example).

Try this:

import numpy as np

...

mybins = np.arange(0,48,2)

n, bins, patches = plt.hist(setA, bins=mybins, alpha=0.40 , label = 'setA', normed=True)  
n, bins, patches = plt.hist(setB, bins=mybins, alpha=0.40 , label = 'setB', normed=True)
n, bins, patches = plt.hist(setC, bins=mybins, alpha=0.40 , label = 'setC', normed=True)

Another option is to plot all three histograms in one call to plt.hist, in which case you can used the stacked=True option, which can further clean up your plot.

Note: this method normalises all three histograms, so the total integral is 1. It does not make all three histograms add up to the same value.

n, bins, patches = plt.hist([setA,setB,setC], bins=mybins, 
                            label = ['setA','setB','setC'], 
                            normed=True, stacked=True)

Or, finally, if a stacked histogram is not to your taste, you can plot the bars next to each other, by again plotting all three histograms in one call, but removing the stacked=True option from the line above:

n, bins, patches = plt.hist([setA,setB,setC], bins=mybins, 
                            label = ['setA','setB','setC'], 
                            normed=True)

As discussed in comments, when used stacked=True, the normed option just means the sum of all three histograms will equal 1, so they may not be normalised in the same way as in the other methods.

To counter this, we can use np.histogram, and plot the results using plt.bar.

For example, using the same data sets from above:

mybins = np.arange(0,48,2)

nA,binsA = np.histogram(setA,bins=mybins,normed=True)
nB,binsB = np.histogram(setB,bins=mybins,normed=True)
nC,binsC = np.histogram(setC,bins=mybins,normed=True)

# Since the sum of each of these will be 1., lets divide by 3.,
# so the sum of the stacked histogram will be 1.
nA/=3.
nB/=3.
nC/=3.

# Use bottom= to set where the bars should begin
plt.bar(binsA[:-1],nA,width=2,color='b',label='setA')
plt.bar(binsB[:-1],nB,width=2,color='g',label='setB',bottom=nA)
plt.bar(binsC[:-1],nC,width=2,color='r',label='setC',bottom=nA+nB)

Answered By: tmdavison

Answer 2

I personally like this function:

def get_histogram(array: np.ndarray,
                  xlabel: str,
                  ylabel: str,
                  title: str,

                  dpi=200,  # dots per inch,
                  facecolor: str = 'white',
                  bins: int = None,
                  show: bool = False,
                  tight_layout=False,
                  linestyle: Optional[str] = '--',
                  alpha: float = 0.75,
                  edgecolor: str = "black",
                  stat: Optional = 'count',
                  color: Optional[str] = None,
                  ):
    """ """
    # - check it's of size (N,)
    if isinstance(array, list):
        array: np.ndarray = np.array(array)
    assert array.shape == (array.shape[0],)
    assert len(array.shape) == 1
    assert isinstance(array.shape[0], int)
    # -
    n: int = array.shape[0]
    if bins is None:
        bins: int = get_num_bins(n, option='square_root')
        # bins: int = get_num_bins(n, option='square_root')
    print(f'using this number of {bins=} and data size is {n=}')
    # -
    fig = plt.figure(dpi=dpi)
    fig.patch.set_facecolor(facecolor)

    import seaborn as sns
    p = sns.histplot(array, stat=stat, color=color)
    # n, bins, patches = plt.hist(array, bins=bins, facecolor='b', alpha=alpha, edgecolor=edgecolor, density=True)

    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    # plt.xlim(40, 160)
    # plt.ylim(0, 0.03)
    plt.grid(linestyle=linestyle) if linestyle else None
    plt.tight_layout() if tight_layout else None
    plt.show() if show else None

sample plot:

Answered By: Charlie Parker

Answer 3

This can be accomplish with seaborn.histplot, or seaborn.displot with kind='hist'.
- seaborn is a high-level API for matplotlib
- Figure-level vs. axes-level functions
There are three primary parameters that will be of interest in relation to the question.
- common_norm: If True and using a normalized statistic, the normalization will apply over the full dataset. Otherwise, normalize each histogram independently.
- multiple: {'layer', 'dodge', 'stack', 'fill'} – how the multiple groups of data are presented.
- stat: Aggregate statistic to compute in each bin, and dependent axis will label corresponding to the selected stat
  - 'probability': normalize such that bar heights sum to 1
  - 'density': normalize such that the total area of the histogram equals 1
  - There’s also 'count', 'frequency', & 'percent'
Tested in python 3.11.2, pandas 2.0.0, matplotlib 3.7.1, seaborn 0.12.2

Imports and Sample Data

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# using the sample sets from the OP
data = {'A': setA, 'B': setB, 'C': setC}

# set some custom bins to compare against the other answer
bins=np.arange(0, 48, 2)

Plots

fig, ax = plt.subplots(figsize=(6.4, 4.3))
sns.histplot(data=data, stat='density', common_norm=True, multiple='dodge', bins=np.arange(0, 48, 2), ax=ax)

g = sns.displot(data=data, kind='hist', stat='density', common_norm=True, multiple='stack', bins=np.arange(0, 48, 2), height=4, aspect=1.25)

Answered By: Trenton McKinney

Normalizing Histograms

Question:

Answers:

Imports and Sample Data

Plots