Seaborn Normalized Bar Chart
Question:
I have a dataframe with two columns containing True and False and one column containing genders: Males and Females.
I’m trying to count the number of True for each column for each gender but normalized by the number of each gender.
What I did so far is to normalize my data against the whole datafame df_up
. But how do I normalize each separately against the number of each gender?
percentage = lambda x: sum(x) / len(df_up)
ax6 = sns.barplot(x="value", y="variable", hue="Gender", data=melted_fan, estimator=percentage, ci=None, palette=palette)
Answers:
I am guessing this is what you did:
import seaborn as sns
import numpy as np
import pandas as pd
df = pd.DataFrame({'Gender':np.random.choice(["Female","Male"],100),
'star_wars_fan':np.random.choice([True,False],100),
'star_trek_fan':np.random.choice([True,False],100)
})
melted_fan = df.groupby('Gender').agg(sum).reset_index().melt(id_vars="Gender")
melted_fan
Gender variable value
0 Female star_wars_fan 29.0
1 Male star_wars_fan 16.0
2 Female star_trek_fan 26.0
3 Male star_trek_fan 29.0
sns.barplot(x="value", y="variable", hue="Gender",
data=melted_fan, ci=None)
Unfortunately in sns.barplot, it is split into the subgroups and the estimator is a function applied to each group, so it’s hard to use that. An easier way is to calculate the percentage before plotting:
melted_fan['perc'] = melted_fan.groupby('variable')['value'].apply(lambda x:100*x/x.sum())
sns.barplot(x="value", y="variable", hue="Gender",
data=melted_fan, ci=None)
This kind of barplot could be constructed via pandas plotting:
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import pandas as pd
import numpy as np
N = 1000
df = pd.DataFrame({'Star Wars': np.random.randint(0, 2, N, dtype=np.bool),
'Star Trek': np.random.randint(0, 2, N, dtype=np.bool),
'Gender': np.random.choice(['Male', 'Female'], N, p=[0.6, 0.4])
})
ax = df.groupby(['Gender'])[['Star Wars', 'Star Trek']].agg('mean').transpose().plot(kind='barh')
ax.xaxis.set_major_formatter(PercentFormatter(1))
plt.show()
The easiest way I found is the pd.cross_tab function that calculates the fractions. Then you can easily make a stacked barplot.
Something like this:
cross_tab = pd.crosstab(index=data['release_year'],
columns=data['type'])
cross_tab_prop.plot(kind='bar',
stacked=True,
colormap='tab10',
figsize=(10, 6))
It’s very well explained here:
https://towardsdatascience.com/100-stacked-charts-in-python-6ca3e1962d2b
I have a dataframe with two columns containing True and False and one column containing genders: Males and Females.
I’m trying to count the number of True for each column for each gender but normalized by the number of each gender.
What I did so far is to normalize my data against the whole datafame df_up
. But how do I normalize each separately against the number of each gender?
percentage = lambda x: sum(x) / len(df_up)
ax6 = sns.barplot(x="value", y="variable", hue="Gender", data=melted_fan, estimator=percentage, ci=None, palette=palette)
I am guessing this is what you did:
import seaborn as sns
import numpy as np
import pandas as pd
df = pd.DataFrame({'Gender':np.random.choice(["Female","Male"],100),
'star_wars_fan':np.random.choice([True,False],100),
'star_trek_fan':np.random.choice([True,False],100)
})
melted_fan = df.groupby('Gender').agg(sum).reset_index().melt(id_vars="Gender")
melted_fan
Gender variable value
0 Female star_wars_fan 29.0
1 Male star_wars_fan 16.0
2 Female star_trek_fan 26.0
3 Male star_trek_fan 29.0
sns.barplot(x="value", y="variable", hue="Gender",
data=melted_fan, ci=None)
Unfortunately in sns.barplot, it is split into the subgroups and the estimator is a function applied to each group, so it’s hard to use that. An easier way is to calculate the percentage before plotting:
melted_fan['perc'] = melted_fan.groupby('variable')['value'].apply(lambda x:100*x/x.sum())
sns.barplot(x="value", y="variable", hue="Gender",
data=melted_fan, ci=None)
This kind of barplot could be constructed via pandas plotting:
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import pandas as pd
import numpy as np
N = 1000
df = pd.DataFrame({'Star Wars': np.random.randint(0, 2, N, dtype=np.bool),
'Star Trek': np.random.randint(0, 2, N, dtype=np.bool),
'Gender': np.random.choice(['Male', 'Female'], N, p=[0.6, 0.4])
})
ax = df.groupby(['Gender'])[['Star Wars', 'Star Trek']].agg('mean').transpose().plot(kind='barh')
ax.xaxis.set_major_formatter(PercentFormatter(1))
plt.show()
The easiest way I found is the pd.cross_tab function that calculates the fractions. Then you can easily make a stacked barplot.
Something like this:
cross_tab = pd.crosstab(index=data['release_year'],
columns=data['type'])
cross_tab_prop.plot(kind='bar',
stacked=True,
colormap='tab10',
figsize=(10, 6))
It’s very well explained here:
https://towardsdatascience.com/100-stacked-charts-in-python-6ca3e1962d2b