Groupby/aggregation shows groups which were supposed to be filtered out before


I have a pandas DataFrame with a column Size, on which I filter first and then group by and count records per group. The result contains also rows for the groups which were filtered out before, but with a count of 0:

    df[df["Size"].isin(("XXS", "XS", "S", "M", "L", "XL", "XXL"))]
        count=("OID", "count"),
    .sort_values("count", ascending=False)

The result DataFrame is shown in the figure below. In my understanding of the groupby function, the groups which were filtered out (I double checked, they are really not anymore in the dataframe) should no longer occur in the aggregated dataframe. Even copying and resetting the index before grouping by does not change the output.

Unfortunately, I was not able to reproduce the issue with a simple example dataframe, so I assume that there is something strange happening. Does anybody have an idea why this could happen?

Result dataframe:

enter image description here

Asked By: Yannic



df[df["Size"].isin(["XXS", "XS", "S", "M", "L", "XL", "XXL"])]
        count=("OID", "count"),
    .sort_values("count", ascending=False)

isin(["XXS", "XS", "S", "M", "L", "XL", "XXL"])
Answered By: Coco Kuang

Sometimes it helps to wait a weekend and think about on Monday again:
The behavior occurred due to categorical datatype of Size column:

>>> df.dtypes

Size                     category

>>> df["Size"].unique()

['S', 'M', 'L', 'XL', 'XXL', 'XS', 'XXS']
Categories (80, object): ['100 CM', '105 CM', '24', '25', ..., 'XS/S', 'XXL', 'XXS', 'XXS/XS']
Answered By: Yannic
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.