Python – Grouping by multiple columns (Categorical dtype vs numeric dtype)- why are the results so different?

Question:

I have a question about grouping pandas DataFrames by multiple columns. I am looking at some data for a TV show and trying to ensure that no season has two contestants with the same name.

Series Name
1 David
1 Edward
1 Jasmine
2 Lea
2 Jonathan
2 Louise

I want a unique count for groupings of Series + Name, which works well when the Series contains a numeric data type. I can do:

df.groupby(['Series','Name'])['Name'].count()

and get

Series Name Count
1 David 1
1 Edward 1
1 Jasmine 1
2 Lea 1
2 Jonathan 1
2 Louise 1

However, if series is set to a categorical data type then

df.groupby(['Series','Name'])['Name'].count()

returns the following table

Series Name Count
1 David 1
2 David 0
1 Edward 1
2 Edward 0
1 Jasmine 1
2 Jasmine 0
1 Jonathan 0
2 Jonathan 1
1 Lea 0
2 Lea 1
1 Louise 0
2 Louise 1

Panda groups every possible combination of series and names and then sorts alphanumerically. I don’t understand why. Any help would be most appreciated.

Asked By: user8340400

||

Answers:

Actually, this is the default behaviour of Pandas when grouping categorical value, it adds missing categories (checkout this thread).
To group only the observed categories on the dataframe you can use:

df.groupby(['Series','Name'],observed=True)['Name'].count()
Answered By: anas laaroussi