drop unused categories using groupby on categorical variable in pandas
Question:
As per Categorical Data – Operations, by default groupby
will show “unused” categories:
In [118]: cats = pd.Categorical(["a","b","b","b","c","c","c"], categories=["a","b","c","d"])
In [119]: df = pd.DataFrame({"cats":cats,"values":[1,2,2,2,3,4,5]})
In [120]: df.groupby("cats").mean()
Out[120]:
values
cats
a 1.0
b 2.0
c 4.0
d NaN
How to obtain the result with the “unused” categories dropped? e.g.
values
cats
a 1.0
b 2.0
c 4.0
Answers:
Option 1
remove_unused_categories
df.groupby(df['cats'].cat.remove_unused_categories()).mean()
values
cats
a 1
b 2
c 4
You can also make the assignment first, and then groupby
–
df.assign(cats=df['cats'].cat.remove_unused_categories()).groupby('cats').mean()
Or,
df['cats'] = df['cats'].cat.remove_unused_categories()
df.groupby('cats').mean()
values
cats
a 1
b 2
c 4
Option 2
astype
to str
conversion –
df.groupby(df['cats'].astype(str)).mean()
values
cats
a 1
b 2
c 4
Just chain with dropna
. Like so:
df.groupby("cats").mean().dropna()
values
cats
a 1.0
b 2.0
c 4.0
If you want to remove unused categories from all categorical columns, you can:
def remove_unused_categories(df: pd.DataFrame):
for c in df.columns:
if pd.api.types.is_categorical_dtype(df[c]):
df[c].cat.remove_unused_categories(inplace=True)
Then before calling groupby
, call:
remove_unused_categories(df_with_empty_cat)
Since version 0.23 you can specify observed=True
in the groupby
call to achieve the desired behavior.
As per Categorical Data – Operations, by default groupby
will show “unused” categories:
In [118]: cats = pd.Categorical(["a","b","b","b","c","c","c"], categories=["a","b","c","d"])
In [119]: df = pd.DataFrame({"cats":cats,"values":[1,2,2,2,3,4,5]})
In [120]: df.groupby("cats").mean()
Out[120]:
values
cats
a 1.0
b 2.0
c 4.0
d NaN
How to obtain the result with the “unused” categories dropped? e.g.
values
cats
a 1.0
b 2.0
c 4.0
Option 1
remove_unused_categories
df.groupby(df['cats'].cat.remove_unused_categories()).mean()
values
cats
a 1
b 2
c 4
You can also make the assignment first, and then groupby
–
df.assign(cats=df['cats'].cat.remove_unused_categories()).groupby('cats').mean()
Or,
df['cats'] = df['cats'].cat.remove_unused_categories()
df.groupby('cats').mean()
values
cats
a 1
b 2
c 4
Option 2
astype
to str
conversion –
df.groupby(df['cats'].astype(str)).mean()
values
cats
a 1
b 2
c 4
Just chain with dropna
. Like so:
df.groupby("cats").mean().dropna()
values
cats
a 1.0
b 2.0
c 4.0
If you want to remove unused categories from all categorical columns, you can:
def remove_unused_categories(df: pd.DataFrame):
for c in df.columns:
if pd.api.types.is_categorical_dtype(df[c]):
df[c].cat.remove_unused_categories(inplace=True)
Then before calling groupby
, call:
remove_unused_categories(df_with_empty_cat)
Since version 0.23 you can specify observed=True
in the groupby
call to achieve the desired behavior.