python/pandas equivalent to dplyr 1.0.0 summarize(across())

Question:

In R, I find the following very useful when dealing with many variables:

library(dplyr)
dat <- group_by(mtcars, cyl) 
summarize(dat, across(c('mpg','disp'), sum), across(c('drat','wt','qsec'), mean))
# A tibble: 3 x 5
    cyl  disp    hp  drat    wt
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     4 1156.   909  4.07  2.29
2     6 1283.   856  3.59  3.12
3     8 4943.  2929  3.23  4.00

Or even better, selecting with pseudo-regex

summarize(dat, across(ends_with('p'), sum), across(ends_with('t'), mean))

In pandas, the equivalent seems to pass variables one-by-one into a dictionary, eg from this gist:

group_agg = df.groupby("group1").agg({
  "var1" : ["mean"], 
  "var2" : ["sum"], 
  "var3" : ["mean"]
  })

Is there a less verbose way to do this operation in pandas, or with some other package?

Asked By: Kyouma

||

Answers:

Assuming you want this for a wide dataframe with many aggregation target columns, a dictionary comprehension could mimic the regex:

group_agg = df.groupby('group1').agg(
    {var: 'sum' if var[-1]=='p' else 'mean' for var in cols})

In general you can build dict-comprehension with other conditions such as dtypes, str in string, count of regex patterns, etc.

Answered By: RichieV

For the first scenario, pandas concat suffices :

dat = df.groupby("cyl")

pd.concat([dat[["mpg", "disp"]].sum(), 
           dat[["drat", "wt", "qsec"]].mean()], 
          axis=1)

For the regex/string processing part, verbose is unavoidable :

cols_p = [col for col in df.columns if col.endswith("p")]
cols_t = [col for col in df.columns if col.endswith("t")]
pd.concat((dat[cols_p].sum(), dat[cols_t].mean()), axis=1)
       disp    hp      drat        wt
cyl
4    1156.5   909  4.070909  2.285727
6    1283.2   856  3.585714  3.117143
8    4943.4  2929  3.229286  3.999214

Another option, which IMO is simpler, is to build the dictionary before aggregating -inspired by @RichieV:

aggs = {key:"mean" 
        if key.endswith("t") 
        else "sum" 
        for key in df 
        if key.endswith(('p', 't'))}

dat.groupby('cyl').agg(aggs)

       disp    hp      drat        wt
cyl
4    1156.5   909  4.070909  2.285727
6    1283.2   856  3.585714  3.117143
8    4943.4  2929  3.229286  3.999214
Answered By: sammywemmy

Using datar, it is very easy to transition your R code to python code without learning APIs of pandas:

>>> from datar import f
>>> from datar.dplyr import summarize, group_by, across, ends_with
>>> from datar.base import sum, mean
>>> from datar.datasets import mtcars
>>> 
>>> dat = group_by(mtcars, f.cyl)
>>> dat >> summarize(across(ends_with('p'), sum), across(ends_with('t'), mean))
   cyl    disp    hp      drat        wt
0    4  1156.5   909  4.070909  2.285727
1    6  1283.2   856  3.585714  3.117143
2    8  4943.4  2929  3.229286  3.999214

I am the author of the package. Feel free to submit issues if you have any questions.

Answered By: Panwen Wang
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.