Custom aggregation that acts on more than one columns in pandas

Question:

Note that this question does not ask whether in pandas we can apply functions on more than one columns during aggregation. Here is an example:

The data frame:

A x y
foo 0 0
foo 1 1
foo 2 2
foo 3 3
bar 0 2
bar 2 3
bar 4 4
bar 6 5

I want to group this table by column A and compute the linear regression y=k*x+b on each group. So we want to achieve this:

A k b
foo 1.0 0.0
bar 0.5 2.0

I tried group by index A, and use aggregate method:

grouped = table.groupby('A')
grouped.aggregate(f)

def f():
    pass

While I find out that this method will split the tabel into series and feed this series into the function f, so f cannot access two columns at the same time.

So, how can I do such "aggregation" function that acts on multiple columns in a split-apply-combine style?

Asked By: core_exe

||

Answers:

If need processing multiple columns togther use GroupBy.apply

def f(x):
    print (x)

grouped = table.groupby('A').apply(f)
Answered By: jezrael

Use groupby.apply with scipy.stats.linregress:

from scipy.stats import linregress

out = (df.groupby('A', as_index=False)
         .apply(lambda g: pd.Series(linregress(g['x'], g['y'])[:2],
                                    index=['k', 'b']))
       )

NB. the first two output parameters of linregress are your k and b.

Output:

     A    k    b
0  bar  0.5  2.0
1  foo  1.0  0.0

Solution with custom function:

from scipy.stats import linregress

def f(x):
    t = linregress(x['x'], x['y'])
    return pd.Series({'k': t.slope, 'b': t.intercept})

df = df.groupby('A', as_index=False).apply(f)
print (df)
     A    k    b
0  bar  0.5  2.0
1  foo  1.0  0.0
Answered By: mozway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.