Python pandas equivalent to R groupby mutate
Question:
So in R when I have a data frame consisting of say 4 columns, call it df
and I want to compute the ratio by sum product of a group, I can it in such a way:
// generate data
df = data.frame(a=c(1,1,0,1,0),b=c(1,0,0,1,0),c=c(10,5,1,5,10),d=c(3,1,2,1,2));
| a b c d |
| 1 1 10 3 |
| 1 0 5 1 |
| 0 0 1 2 |
| 1 1 5 1 |
| 0 0 10 2 |
// compute sum product ratio
df = df%>% group_by(a,b) %>%
mutate(
ratio=c/sum(c*d)
);
| a b c d ratio |
| 1 1 10 3 0.286 |
| 1 1 5 1 0.143 |
| 1 0 5 1 1 |
| 0 0 1 2 0.045 |
| 0 0 10 2 0.454 |
But in python I need to resort to loops.
I know there should be a more elegant way than raw loops in python, anyone got any ideas?
Answers:
According to this thread on pandas github we can use the transform()
method to replicate the combination of dplyr::groupby()
and dplyr::mutate()
. For this example, it would look as follows:
df = pd.DataFrame(
dict(
a=(1 , 1, 0, 1, 0 ),
b=(1 , 0, 0, 1, 0 ),
c=(10, 5, 1, 5, 10),
d=(3 , 1, 2, 1, 2 ),
)
).assign(
prod_c_d = lambda x: x['c'] * x['d'],
ratio = lambda x: x['c'] / (x.groupby(['a','b']).transform('sum')['prod_c_d'])
)
This example uses pandas method chaining. For more information on how to use method chaining to replicate dplyr
workflows see this blogpost.
The method using apply()
and groupby()
does not work for me because it does not seem to be adaptable. For example, it does not work if we delete g.c/
from the lambda expression.
df['ratio'] = df.groupby(['a','b'], group_keys=False)
.apply(lambda g: (g.c * g.d).sum() )
It’s pretty easy to translate your R code into python with datar
:
>>> from datar.all import f, c, tibble, sum, group_by, mutate
[2021-06-24 13:32:29][datar][WARNING] Builtin name "sum" has been overriden by datar.
>>>
>>> df = tibble(a=c(1,1,0,1,0),b=c(1,0,0,1,0),c=c(10,5,1,5,10),d=c(3,1,2,1,2))
>>> df
a b c d
<int64> <int64> <int64> <int64>
0 1 1 10 3
1 1 0 5 1
2 0 0 1 2
3 1 1 5 1
4 0 0 10 2
>>> df >> group_by(f.a, f.b) >> mutate(ratio=f.c/sum(f.c*f.d))
a b c d ratio
<int64> <int64> <int64> <int64> <float64>
0 1 1 10 3 0.285714
1 1 0 5 1 1.000000
2 0 0 1 2 0.045455
3 1 1 5 1 0.142857
4 0 0 10 2 0.454545
[Groups: a, b (n=3)]
Disclaimer: I am the author of the datar
package.
So in R when I have a data frame consisting of say 4 columns, call it df
and I want to compute the ratio by sum product of a group, I can it in such a way:
// generate data
df = data.frame(a=c(1,1,0,1,0),b=c(1,0,0,1,0),c=c(10,5,1,5,10),d=c(3,1,2,1,2));
| a b c d |
| 1 1 10 3 |
| 1 0 5 1 |
| 0 0 1 2 |
| 1 1 5 1 |
| 0 0 10 2 |
// compute sum product ratio
df = df%>% group_by(a,b) %>%
mutate(
ratio=c/sum(c*d)
);
| a b c d ratio |
| 1 1 10 3 0.286 |
| 1 1 5 1 0.143 |
| 1 0 5 1 1 |
| 0 0 1 2 0.045 |
| 0 0 10 2 0.454 |
But in python I need to resort to loops.
I know there should be a more elegant way than raw loops in python, anyone got any ideas?
According to this thread on pandas github we can use the transform()
method to replicate the combination of dplyr::groupby()
and dplyr::mutate()
. For this example, it would look as follows:
df = pd.DataFrame(
dict(
a=(1 , 1, 0, 1, 0 ),
b=(1 , 0, 0, 1, 0 ),
c=(10, 5, 1, 5, 10),
d=(3 , 1, 2, 1, 2 ),
)
).assign(
prod_c_d = lambda x: x['c'] * x['d'],
ratio = lambda x: x['c'] / (x.groupby(['a','b']).transform('sum')['prod_c_d'])
)
This example uses pandas method chaining. For more information on how to use method chaining to replicate dplyr
workflows see this blogpost.
The method using apply()
and groupby()
does not work for me because it does not seem to be adaptable. For example, it does not work if we delete g.c/
from the lambda expression.
df['ratio'] = df.groupby(['a','b'], group_keys=False)
.apply(lambda g: (g.c * g.d).sum() )
It’s pretty easy to translate your R code into python with datar
:
>>> from datar.all import f, c, tibble, sum, group_by, mutate
[2021-06-24 13:32:29][datar][WARNING] Builtin name "sum" has been overriden by datar.
>>>
>>> df = tibble(a=c(1,1,0,1,0),b=c(1,0,0,1,0),c=c(10,5,1,5,10),d=c(3,1,2,1,2))
>>> df
a b c d
<int64> <int64> <int64> <int64>
0 1 1 10 3
1 1 0 5 1
2 0 0 1 2
3 1 1 5 1
4 0 0 10 2
>>> df >> group_by(f.a, f.b) >> mutate(ratio=f.c/sum(f.c*f.d))
a b c d ratio
<int64> <int64> <int64> <int64> <float64>
0 1 1 10 3 0.285714
1 1 0 5 1 1.000000
2 0 0 1 2 0.045455
3 1 1 5 1 0.142857
4 0 0 10 2 0.454545
[Groups: a, b (n=3)]
Disclaimer: I am the author of the datar
package.