Plotting categorical data with pandas and matplotlib
Question:
I have a data frame with categorical data:
colour direction
1 red up
2 blue up
3 green down
4 red left
5 red right
6 yellow down
7 blue down
I want to generate some graphs, like pie charts and histograms based on the categories. Is it possible without creating dummy numeric variables? Something like
df.plot(kind='hist')
Answers:
like this :
df.groupby('colour').size().plot(kind='bar')
You might find useful mosaic
plot from statsmodels. Which can also give statistical highlighting for the variances.
from statsmodels.graphics.mosaicplot import mosaic
plt.rcParams['font.size'] = 16.0
mosaic(df, ['direction', 'colour']);
But beware of the 0 sized cell – they will cause problems with labels.
See this answer for details
You could also use countplot
from seaborn
. This package builds on pandas
to create a high level plotting interface. It gives you good styling and correct axis labels for free.
import pandas as pd
import seaborn as sns
sns.set()
df = pd.DataFrame({'colour': ['red', 'blue', 'green', 'red', 'red', 'yellow', 'blue'],
'direction': ['up', 'up', 'down', 'left', 'right', 'down', 'down']})
sns.countplot(df['colour'], color='gray')
It also supports coloring the bars in the right color with a little trick
sns.countplot(df['colour'],
palette={color: color for color in df['colour'].unique()})
To plot multiple categorical features as bar charts on the same plot, I would suggest:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(
{
"colour": ["red", "blue", "green", "red", "red", "yellow", "blue"],
"direction": ["up", "up", "down", "left", "right", "down", "down"],
}
)
categorical_features = ["colour", "direction"]
fig, ax = plt.subplots(1, len(categorical_features))
for i, categorical_feature in enumerate(df[categorical_features]):
df[categorical_feature].value_counts().plot("bar", ax=ax[i]).set_title(categorical_feature)
fig.show()
Using plotly
import plotly.express as px
px.bar(df["colour"].value_counts())
Roman’s answer is very helpful and correct but in latest version you also need to specify kind as the parameter’s order can change.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(
{
"colour": ["red", "blue", "green", "red", "red", "yellow", "blue"],
"direction": ["up", "up", "down", "left", "right", "down", "down"],
}
)
categorical_features = ["colour", "direction"]
fig, ax = plt.subplots(1, len(categorical_features))
for i, categorical_feature in enumerate(df[categorical_features]):
df[categorical_feature].value_counts().plot(kind="bar", ax=ax[i]).set_title(categorical_feature)
fig.show()
Pandas.Series.plot.pie
https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.pie.html
We can do a little better than that without straying from the built-in functionality.
People love to hate on pie charts, but they have the same benefit as a mosaic/tree; they help keep proportion-to-the-whole interpretable.
kwargs = dict(
startangle = 90,
colormap = 'Pastel2',
fontsize = 13,
explode = (0.1,0.1,0.1),
figsize = (60,5),
autopct = '%1.1f%%',
title = 'Chemotherapy Stratification'
)
df['treatment_chemo'].value_counts().plot.pie(**kwargs)
I have a data frame with categorical data:
colour direction
1 red up
2 blue up
3 green down
4 red left
5 red right
6 yellow down
7 blue down
I want to generate some graphs, like pie charts and histograms based on the categories. Is it possible without creating dummy numeric variables? Something like
df.plot(kind='hist')
like this :
df.groupby('colour').size().plot(kind='bar')
You might find useful mosaic
plot from statsmodels. Which can also give statistical highlighting for the variances.
from statsmodels.graphics.mosaicplot import mosaic
plt.rcParams['font.size'] = 16.0
mosaic(df, ['direction', 'colour']);
But beware of the 0 sized cell – they will cause problems with labels.
See this answer for details
You could also use countplot
from seaborn
. This package builds on pandas
to create a high level plotting interface. It gives you good styling and correct axis labels for free.
import pandas as pd
import seaborn as sns
sns.set()
df = pd.DataFrame({'colour': ['red', 'blue', 'green', 'red', 'red', 'yellow', 'blue'],
'direction': ['up', 'up', 'down', 'left', 'right', 'down', 'down']})
sns.countplot(df['colour'], color='gray')
It also supports coloring the bars in the right color with a little trick
sns.countplot(df['colour'],
palette={color: color for color in df['colour'].unique()})
To plot multiple categorical features as bar charts on the same plot, I would suggest:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(
{
"colour": ["red", "blue", "green", "red", "red", "yellow", "blue"],
"direction": ["up", "up", "down", "left", "right", "down", "down"],
}
)
categorical_features = ["colour", "direction"]
fig, ax = plt.subplots(1, len(categorical_features))
for i, categorical_feature in enumerate(df[categorical_features]):
df[categorical_feature].value_counts().plot("bar", ax=ax[i]).set_title(categorical_feature)
fig.show()
Using plotly
import plotly.express as px
px.bar(df["colour"].value_counts())
Roman’s answer is very helpful and correct but in latest version you also need to specify kind as the parameter’s order can change.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(
{
"colour": ["red", "blue", "green", "red", "red", "yellow", "blue"],
"direction": ["up", "up", "down", "left", "right", "down", "down"],
}
)
categorical_features = ["colour", "direction"]
fig, ax = plt.subplots(1, len(categorical_features))
for i, categorical_feature in enumerate(df[categorical_features]):
df[categorical_feature].value_counts().plot(kind="bar", ax=ax[i]).set_title(categorical_feature)
fig.show()
Pandas.Series.plot.pie
https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.pie.html
We can do a little better than that without straying from the built-in functionality.
People love to hate on pie charts, but they have the same benefit as a mosaic/tree; they help keep proportion-to-the-whole interpretable.
kwargs = dict(
startangle = 90,
colormap = 'Pastel2',
fontsize = 13,
explode = (0.1,0.1,0.1),
figsize = (60,5),
autopct = '%1.1f%%',
title = 'Chemotherapy Stratification'
)
df['treatment_chemo'].value_counts().plot.pie(**kwargs)