how to apply a function based on a condition to individual column in df.groupby.agg()

Question:

I code mostly in R but I want to improve my python skills so now I do my analysis in python. When working with a data frame in R I would constantly make a new summarized dataframe with group_by and summarize. I am trying to do this in python with df.groupby().agg() However I’m finding the process difficult. For example, in R I could apply functions and conditions to individual columns in summarize(). I could also build off a new column just created in the previous line of code inside of summarize. I find this actually really cool what one can do in R.

Anyways how can I apply a function to an individual column like mean() where a certain condition is met in another column? The following code shows how to do this in R and my attempt in python.

library(dplyr)

df %>% group_by(user_id) %>%
       summarize(avg_parking_duration = mean(duration[type == "parking"]),
                 avg_reservation_duration =  mean(duration[type == "reservation"])
)

The code above would group the data frame of parking transactions by user_id. It would then calculate the average parking duration from the column duration where the column type == "parking". Likewise the same is done for the duration of transactions of the type "reservation".

all that I have seen online for python is something like this

df.groupby('user_id').agg(
        avg_parking_duration = ('duration','mean')
    )

or


df.groupby('user_id').agg(
        avg_parking_duration = ('duration',lambda x: x.mean())
    )

I don’t know unfortunately how to apply a condition then specifically to duration based on another column type. How do you go about this?

Please note I do not want to filter out first the data frame by type parkings or reservations before the agg() because I would like to create new multiple columns with different functions inside of agg() and I would need the filtered out information.

dataset

data = {'user_id': [u1, u1,u1,u1, u2, u2,u2,u2], 'type': ['parking', 'parking', 'reservation', 'reservation','parking', 'parking', 'reservation', 'reservation'],'duration': [43,60,20,23,143,290,25,15]}
pd.DataFrame.from_dict(data)
Asked By: Nick

||

Answers:

Please note I do not want to filter out first the data frame by type parkings or reservations before the agg()

It is problematic. Because GroupBy.agg working with each column separately, so cannot filter type and get mean of duration together.

Solution with helper column park and used for mean or hacky filtering original DataFrame per groups – both same ouput:

Solution with helper columns before groupby.agg:

df = (df.assign(park = df["duration"].where(df["type"] == 'parking'),
                res = df["duration"].where(df["type"] == 'reservation'))
       .groupby('user_id').agg(avg_parking_duration = ('park','mean'),
                               avg_reservation_duration =('res','mean')
      ))
print (df)
         avg_parking_duration  avg_reservation_duration
user_id                                                
1                        51.5                      21.5
2                       216.5                      20.0

Hacky way:

#necessary unique values in index
df = df.reset_index(drop=True)

f = lambda x: x.where(df.loc[x.index, "type"] == 'parking').mean()
f1 = lambda x: x.where(df.loc[x.index, "type"] == 'reservation').mean()

df = (df.groupby('user_id').agg(avg_parking_duration = ('duration',f),
                               avg_reservation_duration = ('duration',f1)
      ))
print (df)
         avg_parking_duration  avg_reservation_duration
user_id                                                
1                        51.5                      21.5
2                       216.5                      20.0
Answered By: jezrael
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.