Python – Grouping by multiple columns (Categorical dtype vs numeric dtype)- why are the results so different?

Question

I have a question about grouping pandas DataFrames by multiple columns. I am looking at some data for a TV show and trying to ensure that no season has two contestants with the same name.

Series	Name
1	David
1	Edward
1	Jasmine
2	Lea
2	Jonathan
2	Louise

I want a unique count for groupings of Series + Name, which works well when the Series contains a numeric data type. I can do:

df.groupby(['Series','Name'])['Name'].count()

and get

Series	Name	Count
1	David	1
1	Edward	1
1	Jasmine	1
2	Lea	1
2	Jonathan	1
2	Louise	1

However, if series is set to a categorical data type then

df.groupby(['Series','Name'])['Name'].count()

returns the following table

Series	Name	Count
1	David	1
2	David	0
1	Edward	1
2	Edward	0
1	Jasmine	1
2	Jasmine	0
1	Jonathan	0
2	Jonathan	1
1	Lea	0
2	Lea	1
1	Louise	0
2	Louise	1

Panda groups every possible combination of series and names and then sorts alphanumerically. I don’t understand why. Any help would be most appreciated.

Asked By: user8340400

||

Source

Answer 1

Actually, this is the default behaviour of Pandas when grouping categorical value, it adds missing categories (checkout this thread).
To group only the observed categories on the dataframe you can use:

df.groupby(['Series','Name'],observed=True)['Name'].count()

Answered By: anas laaroussi

Python – Grouping by multiple columns (Categorical dtype vs numeric dtype)- why are the results so different?

Question:

Answers: