Is there a parameter in matplotlib/pandas to have the Y axis of a histogram as percentage?
Question:
I would like to compare two histograms by having the Y axis show the percentage of each column from the overall dataset size instead of an absolute value. Is that possible? I am using Pandas and matplotlib.
Thanks
Answers:
The density=True
(normed=True
for matplotlib < 2.2.0
) returns a histogram for which np.sum(pdf * np.diff(bins))
equals 1. If you want the sum of the histogram to be 1 you can use Numpy’s histogram() and normalize the results yourself.
x = np.random.randn(30)
fig, ax = plt.subplots(1,2, figsize=(10,4))
ax[0].hist(x, density=True, color='grey')
hist, bins = np.histogram(x)
ax[1].bar(bins[:-1], hist.astype(np.float32) / hist.sum(), width=(bins[1]-bins[0]), color='grey')
ax[0].set_title('normed=True')
ax[1].set_title('hist = hist / hist.sum()')
Btw: Strange plotting glitch at the first bin of the left plot.
Pandas plotting can accept any extra keyword arguments from the respective matplotlib function. So for completeness from the comments of others here, this is how one would do it:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100,2), columns=list('AB'))
df.hist(density=1)
Also, for direct comparison this may be a good way as well:
df.plot(kind='hist', density=1, bins=20, stacked=False, alpha=.5)
Looks like @CarstenKönig found the right way:
df.hist(bins=20, weights=np.ones_like(df[df.columns[0]]) * 100. / len(df))
You can simplify the weighting using np.ones_like():
df["ColumnName"].plot.hist(weights = np.ones_like(df.index) / len(df.index))
- np.ones_like() is okay with the df.index structure
- len(df.index) is faster for large DataFrames
I know this answer is 6 years later but to anyone using density=True (the substitute for the normed=True), this is not doing what you might want to. It will normalize the whole distribution so that the area of the bins is 1. So if you have more bins with a width < 1 you can expect the height to be > 1 (y-axis). If you want to bound your histogram to [0;1] you will have to calculate it yourself.
I see this is an old question but it shows up on top for some searches, so I think as of 2021 seaborn would be an easy way to do this.
You can do something like this:
import seaborn as sns
sns.histplot(df,stat="probability")
In some scenarios you can adapt with a barplot:
tweets_df['label'].value_counts(normalize=True).plot(figsize=(12,12), kind='bar')
I would like to compare two histograms by having the Y axis show the percentage of each column from the overall dataset size instead of an absolute value. Is that possible? I am using Pandas and matplotlib.
Thanks
The density=True
(normed=True
for matplotlib < 2.2.0
) returns a histogram for which np.sum(pdf * np.diff(bins))
equals 1. If you want the sum of the histogram to be 1 you can use Numpy’s histogram() and normalize the results yourself.
x = np.random.randn(30)
fig, ax = plt.subplots(1,2, figsize=(10,4))
ax[0].hist(x, density=True, color='grey')
hist, bins = np.histogram(x)
ax[1].bar(bins[:-1], hist.astype(np.float32) / hist.sum(), width=(bins[1]-bins[0]), color='grey')
ax[0].set_title('normed=True')
ax[1].set_title('hist = hist / hist.sum()')
Btw: Strange plotting glitch at the first bin of the left plot.
Pandas plotting can accept any extra keyword arguments from the respective matplotlib function. So for completeness from the comments of others here, this is how one would do it:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100,2), columns=list('AB'))
df.hist(density=1)
Also, for direct comparison this may be a good way as well:
df.plot(kind='hist', density=1, bins=20, stacked=False, alpha=.5)
Looks like @CarstenKönig found the right way:
df.hist(bins=20, weights=np.ones_like(df[df.columns[0]]) * 100. / len(df))
You can simplify the weighting using np.ones_like():
df["ColumnName"].plot.hist(weights = np.ones_like(df.index) / len(df.index))
- np.ones_like() is okay with the df.index structure
- len(df.index) is faster for large DataFrames
I know this answer is 6 years later but to anyone using density=True (the substitute for the normed=True), this is not doing what you might want to. It will normalize the whole distribution so that the area of the bins is 1. So if you have more bins with a width < 1 you can expect the height to be > 1 (y-axis). If you want to bound your histogram to [0;1] you will have to calculate it yourself.
I see this is an old question but it shows up on top for some searches, so I think as of 2021 seaborn would be an easy way to do this.
You can do something like this:
import seaborn as sns
sns.histplot(df,stat="probability")
In some scenarios you can adapt with a barplot:
tweets_df['label'].value_counts(normalize=True).plot(figsize=(12,12), kind='bar')