Groupby/aggregation shows groups which were supposed to be filtered out before
Question:
I have a pandas DataFrame with a column Size
, on which I filter first and then group by and count records per group. The result contains also rows for the groups which were filtered out before, but with a count of 0:
(
df[df["Size"].isin(("XXS", "XS", "S", "M", "L", "XL", "XXL"))]
.groupby("Size")
.agg(
count=("OID", "count"),
)
.sort_values("count", ascending=False)
)
The result DataFrame is shown in the figure below. In my understanding of the groupby function, the groups which were filtered out (I double checked, they are really not anymore in the dataframe) should no longer occur in the aggregated dataframe. Even copying and resetting the index before grouping by does not change the output.
Unfortunately, I was not able to reproduce the issue with a simple example dataframe, so I assume that there is something strange happening. Does anybody have an idea why this could happen?
Result dataframe:
Answers:
df[df["Size"].isin(["XXS", "XS", "S", "M", "L", "XL", "XXL"])]
.groupby("Size")
.agg(
count=("OID", "count"),
)
.sort_values("count", ascending=False)
====================================================
isin(["XXS", "XS", "S", "M", "L", "XL", "XXL"])
Sometimes it helps to wait a weekend and think about on Monday again:
The behavior occurred due to categorical datatype of Size
column:
>>> df.dtypes
Size category
...
>>> df["Size"].unique()
['S', 'M', 'L', 'XL', 'XXL', 'XS', 'XXS']
Categories (80, object): ['100 CM', '105 CM', '24', '25', ..., 'XS/S', 'XXL', 'XXS', 'XXS/XS']
I have a pandas DataFrame with a column Size
, on which I filter first and then group by and count records per group. The result contains also rows for the groups which were filtered out before, but with a count of 0:
(
df[df["Size"].isin(("XXS", "XS", "S", "M", "L", "XL", "XXL"))]
.groupby("Size")
.agg(
count=("OID", "count"),
)
.sort_values("count", ascending=False)
)
The result DataFrame is shown in the figure below. In my understanding of the groupby function, the groups which were filtered out (I double checked, they are really not anymore in the dataframe) should no longer occur in the aggregated dataframe. Even copying and resetting the index before grouping by does not change the output.
Unfortunately, I was not able to reproduce the issue with a simple example dataframe, so I assume that there is something strange happening. Does anybody have an idea why this could happen?
Result dataframe:
df[df["Size"].isin(["XXS", "XS", "S", "M", "L", "XL", "XXL"])]
.groupby("Size")
.agg(
count=("OID", "count"),
)
.sort_values("count", ascending=False)
====================================================
isin(["XXS", "XS", "S", "M", "L", "XL", "XXL"])
Sometimes it helps to wait a weekend and think about on Monday again:
The behavior occurred due to categorical datatype of Size
column:
>>> df.dtypes
Size category
...
>>> df["Size"].unique()
['S', 'M', 'L', 'XL', 'XXL', 'XS', 'XXS']
Categories (80, object): ['100 CM', '105 CM', '24', '25', ..., 'XS/S', 'XXL', 'XXS', 'XXS/XS']