Pandas groupby with categories with redundant nan
Question:
I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. But it insists that, when grouping by multiple categories, every combination of categories must be accounted for.
I sometimes use categories even when there’s a low density of common strings, simply because those strings are long and it saves memory / improves performance. Sometimes there are thousands of categories in each column. When grouping by 3 columns, pandas
forces us to hold results for 1000^3 groups.
My question: is there a convenient way to use groupby
with categories while avoiding this untoward behaviour? I’m not looking for any of these solutions:
 Recreating all the functionality via
numpy
.  Continually converting to strings/codes before
groupby
, reverting to categories later.  Making a tuple column from group columns, then group by the tuple column.
I’m hoping there’s a way to modify just this particular pandas
idiosyncrasy. A simple example is below. Instead of 4 categories I want in the output, I end up with 12.
import pandas as pd
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', 'D', 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
for col in group_cols:
df[col] = df[col].astype('category')
df.groupby(group_cols, as_index=False).sum()
Group1 Group2 Group3 Value
# A A A NaN
# A A C NaN
# A A D NaN
# A B A NaN
# A B C 54.34
# A B D 826.74
# B A A 765.40
# B A C 514.50
# B A D NaN
# B B A NaN
# B B C NaN
# B B D NaN
Bounty update
The issue is poorly addressed by pandas development team (cf github.com/pandasdev/pandas/issues/17594). Therefore, I am looking for responses that address any of the following:
 Why, with reference to pandas source code, is categorical data treated differently in groupby operations?
 Why would the current implementation be preferred? I appreciate this is subjective, but I am struggling to find any answer to this question. Current behaviour is prohibitive in many situations without cumbersome, potentially expensive, workarounds.
 Is there a clean solution to override pandas treatment of categorical data in groupby operations? Note the 3 nogo routes (dropping down to numpy; conversions to/from codes; creating and grouping by tuple columns). I would prefer a solution that is “pandascompliant” to minimise / avoid loss of other pandas categorical functionality.
 A response from pandas development team to support and clarify existing treatment. Also, why should considering all category combinations not be configurable as a Boolean parameter?
Bounty update #2
To be clear, I’m not expecting answers to all of the above 4 questions. The main question I am asking is whether it’s possible, or advisable, to overwrite pandas
library methods so that categories are treated in a way that facilitates groupby
/ set_index
operations.
Answers:
I was able to get a solution that should work really well. I’ll edit my post with a better explanation. But in the mean time, does this work well for you?
import pandas as pd
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', 'D', 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
for col in group_cols:
df[col] = df[col].astype('category')
result = df.groupby([df[col].values.codes for col in group_cols]).sum()
result = result.reset_index()
level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
result = result.rename(columns=level_to_column_name)
for col in group_cols:
result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
result
So the answer to this felt more like a proper programming than a normal Pandas question. Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby on these underlying numbers because they don’t have the same problem as categorical columns. After doing this I had to rename the columns. I then used the from_codes constructor to create efficiently turn the list of integers back into a categorical column.
Group1 Group2 Group3 Value
A B C 54.34
A B D 826.74
B A A 765.40
B A C 514.50
So I understand that this isn’t exactly your answer but I’ve made my solution into a little function for people that have this problem in the future.
def categorical_groupby(df,group_cols,agg_fuction="sum"):
"Does a groupby on a number of categorical columns"
result = df.groupby([df[col].values.codes for col in group_cols]).agg(agg_fuction)
result = result.reset_index()
level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
result = result.rename(columns=level_to_column_name)
for col in group_cols:
result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
return result
call it like this:
df.pipe(categorical_groupby,group_cols)
I found the behavior similar to what’s documented in the operations section of Categorical Data.
In particular, similar to
In [121]: cats2 = pd.Categorical(["a","a","b","b"], categories=["a","b","c"]) In [122]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]}) In [123]: df2.groupby(["cats","B"]).mean() Out[123]: values cats B a c 1.0 d 2.0 b c 3.0 d 4.0 c c NaN d NaN
Some other words describing the related behavior in Series
and groupby
. There is also a pivot table example in the end of the section.
Apart from Series.min(), Series.max() and Series.mode(), the following
operations are possible with categorical data:Series methods like Series.value_counts() will use all categories,
even if some categories are not present in the data:Groupby will also show “unused” categories:
The words and the example are cited from Categorical Data.
There is a lot of questions to be answered here.
Let’s start by understanding what a ‘category’ is…
Definition of Categorical dtype
Quoting from pandas docs for “Categorical Data”:
Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.
There are two points I want to focus on here:

The definition of categoricals as a statistical variable:
basically, this means we have to look at them from a statistical point of view, not the “regular” programming one. i.e. they are not ‘enumerates’. Statistical categorical variables has specific operations and usecases, you can read more about them in wikipedia.
I’ll talk more about this after the second point. 
Categories are levels in R:
We can understand more about categoricals if we read aboutR
levels and factors.
I don’t know much about R, but I found this source simple and sufficient. Quoting an interesting example from it:When a factor is first created, all of its levels are stored along with the factor, and if subsets of the factor are extracted, they will retain all of the original levels. This can create problems when constructing model matrices and may or may not be useful when displaying the data using, say, the table function. As an example, consider a random sample from the letters vector, which is part of the base R distribution. > lets = sample(letters,size=100,replace=TRUE) > lets = factor(lets) > table(lets[1:5]) a b c d e f g h i j k l m n o p q r s t u v w x y z 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 Even though only five of the levels were actually represented, the table function shows the frequencies for all of the levels of the original factors. To change this, we can simply use another call to factor > table(factor(lets[1:5])) a k q s z 1 1 1 1 1
Basically this tells us that displaying/using all the categories even if they are not needed is not that uncommon. And actually, it’s the default behavior!
This is due to the usual usecases of categorical variables in statistics. Almost in all the cases you do care about all the categories even if they are not used. Take for example the pandas function cut.
I hope by this point that you understood why this behavior exists in pandas.
GroupBy on Categorical Variables
As of why does groupby
consider all the combinations of categories: I can’t say for sure, but my best guess based on a quick review of the source code (and the github issue you mentioned), is that they consider the groupby
on categorical variables an interaction between them. Hence, It should consider all the pairs/tuples (like a Cartesian product). AFAIK, this helps a lot when you are trying to do something like ANOVA.
This also means that in this context you can’t think of it in the usual SQLlike terminology.
Solutions?
Ok, but what if you don’t want this behavior?
To the best of my knowledge, and taking into account that I spent the last night tracing this in pandas source code, you can’t “disable” it. It’s hard coded in every critical step.
However, because of the way groupby
works, the actual “expanding” doesn’t happen until it’s needed. For example, when calling sum
over the groups or trying to print them.
Hence, you can do any of the following to get only the needed groups:
df.groupby(group_cols).indices
#{('A', 'B', 'C'): array([0]),
# ('A', 'B', 'D'): array([1, 4]),
# ('B', 'A', 'A'): array([3]),
# ('B', 'A', 'C'): array([2])}
df.groupby(group_cols).groups
#{('A', 'B', 'C'): Int64Index([0], dtype='int64'),
# ('A', 'B', 'D'): Int64Index([1, 4], dtype='int64'),
# ('B', 'A', 'A'): Int64Index([3], dtype='int64'),
# ('B', 'A', 'C'): Int64Index([2], dtype='int64')}
# an example
for g in df.groupby(group_cols).groups:
print(g, grt.get_group(g).sum()[0])
#('A', 'B', 'C') 54.34
#('A', 'B', 'D') 826.74
#('B', 'A', 'A') 765.4
#('B', 'A', 'C') 514.5
I know this is a nogo for you, but I’m 99% sure that there is no direct way to do this.
I agree that there should be a boolean variable to disable this behavior and use the “regular” SQLlike one.
Since Pandas 0.23.0, the groupby
method can now take a parameter observed
which fixes this issue if it is set to True
(False by default).
Below is the exact same code as in the question with just observed=True
added :
import pandas as pd
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', 'D', 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
for col in group_cols:
df[col] = df[col].astype('category')
df.groupby(group_cols, as_index=False, observed=True).sum()
I found this post while debugging something similar. Very good post, and I really like the inclusion of boundary conditions!
Here’s the code that accomplishes the initial goal:
r = df.groupby(group_cols, as_index=False).agg({'Value': 'sum'})
r.columns = ['_'.join(col).strip('_') for col in r.columns]
The downside of this solution is that it results in a hierarchical column index that you may want to flatten (especially if you have multiple statistics). I included flattening of column index in the code above.
I don’t know why instance methods:
df.groupby(group_cols).sum()
df.groupby(group_cols).mean()
df.groupby(group_cols).stdev()
use all unique combinations of categorical variables, while the .agg() method:
df.groupby(group_cols).agg(['count', 'sum', 'mean', 'std'])
ignores the unused level combinations of the groups. That seems inconsistent. Just happy that we can use the .agg() method and not have to worry about a Cartesian combination explosion.
Also, I think it is very common to have a much lower unique cardinality count vs. the Cartesian product. Think of all the cases where data has columns like “State”, “County”, ‘Zip”… these are all nested variables and many data sets out there have variables that have a high degree of nesting.
In our case the difference between Cartesian product of the grouping variables and the naturally occurring combinations is over 1000x (and the starting data set is over 1,000,000 rows).
Consequently, I would have voted for making observed=True the default behavior.
The main advantages of using categorical dtype are:
 Memory efficiency. The data is stored as integer codes, which are smaller in size than strings, the category type requires less memory to store the same amount of data compared to object type or int type data.
 Faster processing. Categorical data operations such as group by are generally faster than equivalent operations on object or int type data because they can be performed on the integer codes, which are more efficient to work with than strings.
The cons are:
 group by output: the output of the groupby is very messy. a lof of Nan are generated depending on your categories values.
 the same problem applies to the filtering.
 concatenation issue with category type: the category type is linked to a dictionary of values so when you concatenate or merge you will have trouble and the loss of the category dtype.
You could have more indepth information from this article: https://medium.com/gitconnected/pandascategorytypeprosandcons1bcac1bdea71