How to use groupby transform across multiple columns


I have a big dataframe, and I’m grouping by one to n columns, and want to apply a function on these groups across two columns (e.g. foo and bar).

Here’s an example dataframe:

foo_function = lambda x: np.sum(x.a+x.b)

df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  

# works with apply, but I want transform:
df.groupby(['c', 'd'])[['a','b']].apply(foo_function)
# transform doesn't work!
df.groupby(['c', 'd'])[['a','b']].transform(foo_function)
TypeError: cannot concatenate a non-NDFrame object

But transform apparently isn’t able to combine multiple columns together because it looks at each column separately (unlike apply). What is the next best alternative in terms of speed / elegance? e.g. I could use apply and then create df['new_col'] by using pd.match, but that would necessitate matching over sometimes multiple groupby columns (col1 and col2) which seems really hacky / would take a fair amount of code.

–> Is there a function that is like groupby().transform that can use functions that work over multiple columns? If this doesn’t exist, what’s the best hack?

Asked By: Hillary Sanders



Circa Pandas version 0.18, it appears the original answer (below) no longer works.

Instead, if you need to do a groupby computation across multiple columns, do the multi-column computation first, and then the groupby:

df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  
df['e'] = df['a'] + df['b']
df['e'] = (df.groupby(['c', 'd'])['e'].transform('sum'))


   a  b  c  d   e
0  1  1  q  z  12
1  2  2  q  z  12
2  3  3  q  z  12
3  4  4  q  o   8
4  5  5  w  o  22
5  6  6  w  o  22

Original answer:

The error message:

TypeError: cannot concatenate a non-NDFrame object

suggests that in order to concatenate, the foo_function should return an NDFrame (such as a Series or DataFrame). If you return a Series, then:

In [99]: df.groupby(['c', 'd']).transform(lambda x: pd.Series(np.sum(x['a']+x['b'])))
    a   b
0  12  12
1  12  12
2  12  12
3   8   8
4  22  22
5  22  22
Answered By: unutbu

The way I read the question, you want to be able to do something arbitrary with both the individual values from both columns. You just need to make sure to return a dataframe of the same size as you get passed in. I think the best way is to just make a new column, like this:

df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  

def f(x):
    return pd.DataFrame({'e':y,'a':x['a'],'b':x['b']})



    a   b   e
0   1   1   0.333333
1   2   2   0.666667
2   3   3   1.000000
3   4   4   2.000000
4   5   5   0.909091
5   6   6   1.090909

If you have a very complicated dataframe, you can pick your columns (e.g. df.groupby(['c'])['a','b','e'].transform(f))

This sure looks very inelegant to me, but it’s still much faster than apply on large datasets.

Another alternative is to use set_index to capture all the columns you need and then pass just one column to transform.

Answered By: Victor Chubukov

The following workaround allows you to transform with similar transform syntax, using .groupby and .apply instead.

So you don’t have break multi-column computation apart, hence fragmenting the processing steps.

df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  

group = ['c', 'd']
df['result'] = df.groupby(group)
        # your typical transform function here
        lambda df: (df.a + df.b)/df.b.sum()
    ).reset_index(group, drop=True)

    a   b   c   d   result
0   1   1   q   z   0.333333
1   2   2   q   z   0.666667
2   3   3   q   z   1.000000
3   4   4   q   o   2.000000
4   5   5   w   o   0.909091
5   6   6   w   o   1.090909
Answered By: LSM
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.