Groupby.mean() if condition is true

Question:

I got the following dataframe:

   index  user  default_shipping_cost     category  shipping_cost  shipping_coalesce  estimated_shipping_cost
0      0     1                      1      clothes            NaN                1.0                      6.0
1      1     1                      1  electronics            2.0                2.0                      6.0
2      2     1                     15    furniture            NaN               15.0                      6.0
3      3     2                     15    furniture            NaN               15.0                     15.0
4      4     2                     15    furniture            NaN               15.0                     15.0

Per user, combine shipping_cost with default_shipping_cost and calculate the mean of the combined shipping_costs but only if there is at least one shipping_cost given.

Explanation:

  • user_1 shipping_cost is given (at least once) so we can calculate the mean
  • user_2 there are no shipping_cost, so I would like to go with Nan

Code:

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option('display.width', 1000)

df = pd.DataFrame(
    {
        'user': [1, 1, 1, 2, 2],
        'default_shipping_cost': [1, 1, 15, 15, 15],
        'category': ['clothes', 'electronics', 'furniture', 'furniture', 'furniture'],
        'shipping_cost': [None, 2, None, None, None]
    }
)
df.reset_index(inplace=True)
df['shipping_coalesce'] = df.shipping_cost.combine_first(df.default_shipping_cost)

dfg_user = df.groupby(['user'])
df['estimated_shipping_cost'] = dfg_user['shipping_coalesce'].transform("mean")
print(df)

Expected output:

   index  user  default_shipping_cost     category  shipping_cost  shipping_coalesce  estimated_shipping_cost
0      0     1                      1      clothes            NaN                1.0                      6.0
1      1     1                      1  electronics            2.0                2.0                      6.0
2      2     1                     15    furniture            NaN               15.0                      6.0
3      3     2                     15    furniture            NaN               15.0                      NaN
4      4     2                     15    furniture            NaN               15.0                      NaN
Asked By: MafMal

||

Answers:

Add an extra condition with transform('any') and where:

df['estimated_shipping_cost'] = (dfg_user['shipping_coalesce'].transform('mean')
                                .where(dfg_user['shipping_cost'].transform('any'))
                                )

Output:

   index  user  default_shipping_cost     category  shipping_cost  shipping_coalesce  estimated_shipping_cost
0      0     1                      1      clothes            NaN                1.0                      6.0
1      1     1                      1  electronics            2.0                2.0                      6.0
2      2     1                     15    furniture            NaN               15.0                      6.0
3      3     2                     15    furniture            NaN               15.0                      NaN
4      4     2                     15    furniture            NaN               15.0                      NaN

Intermediate:

dfg_user['shipping_cost'].transform('any')

0     True
1     True
2     True
3    False
4    False
Name: shipping_cost, dtype: bool
Answered By: mozway

Try:

valid_users = df.loc[df["shipping_cost"].notna(), "user"].unique()

df["estimated_shipping_cost"] = (
    df[df["user"].isin(valid_users)]
    .groupby("user")["shipping_coalesce"]
    .transform("mean")
)

print(df)

Prints:

   user  default_shipping_cost     category  shipping_cost  shipping_coalesce  estimated_shipping_cost
0     1                      1      clothes            NaN                1.0                      6.0
1     1                      1  electronics            2.0                2.0                      6.0
2     1                     15    furniture            NaN               15.0                      6.0
3     2                     15    furniture            NaN               15.0                      NaN
4     2                     15    furniture            NaN               15.0                      NaN
Answered By: Andrej Kesely

Ultilizing panda’s aggregate function, explaination is written in code’s comment:

#aggregate function to check if all shipping_cost of a user is NaN
def testFunc (a):
    return a.isnull().values.all()
#applying that aggregate function, applying only to the 'shipping_cost' column
result = df[['user', 'shipping_cost']].groupby (['user']).aggregate(testFunc) [['shipping_cost']]
#Rename to make it cleaner
result = result.rename(columns={"shipping_cost" : "shipping_cost_check"})
#join result
df = df.join (result, on = 'user')
#replace estimated_shipping_cost with nan if the check about return True
df.loc[df['shipping_cost_check'],'estimated_shipping_cost'] = np.nan
#drop the extra column
df = df.drop ('shipping_cost_check', axis=1)

Output:

   index  user  default_shipping_cost     category  shipping_cost  shipping_coalesce  estimated_shipping_cost
0      0     1                      1      clothes            NaN                1.0                      6.0
1      1     1                      1  electronics            2.0                2.0                      6.0
2      2     1                     15    furniture            NaN               15.0                      6.0
3      3     2                     15    furniture            NaN               15.0                      NaN
4      4     2                     15    furniture            NaN               15.0                      NaN
Answered By: Gia Huynh
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.