Groupby.mean() if condition is true
Question:
I got the following dataframe:
index user default_shipping_cost category shipping_cost shipping_coalesce estimated_shipping_cost
0 0 1 1 clothes NaN 1.0 6.0
1 1 1 1 electronics 2.0 2.0 6.0
2 2 1 15 furniture NaN 15.0 6.0
3 3 2 15 furniture NaN 15.0 15.0
4 4 2 15 furniture NaN 15.0 15.0
Per user, combine shipping_cost with default_shipping_cost and calculate the mean of the combined shipping_costs but only if there is at least one shipping_cost given.
Explanation:
- user_1
shipping_cost
is given (at least once) so we can calculate the mean
- user_2 there are no
shipping_cost
, so I would like to go with Nan
Code:
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option('display.width', 1000)
df = pd.DataFrame(
{
'user': [1, 1, 1, 2, 2],
'default_shipping_cost': [1, 1, 15, 15, 15],
'category': ['clothes', 'electronics', 'furniture', 'furniture', 'furniture'],
'shipping_cost': [None, 2, None, None, None]
}
)
df.reset_index(inplace=True)
df['shipping_coalesce'] = df.shipping_cost.combine_first(df.default_shipping_cost)
dfg_user = df.groupby(['user'])
df['estimated_shipping_cost'] = dfg_user['shipping_coalesce'].transform("mean")
print(df)
Expected output:
index user default_shipping_cost category shipping_cost shipping_coalesce estimated_shipping_cost
0 0 1 1 clothes NaN 1.0 6.0
1 1 1 1 electronics 2.0 2.0 6.0
2 2 1 15 furniture NaN 15.0 6.0
3 3 2 15 furniture NaN 15.0 NaN
4 4 2 15 furniture NaN 15.0 NaN
Answers:
Add an extra condition with transform('any')
and where
:
df['estimated_shipping_cost'] = (dfg_user['shipping_coalesce'].transform('mean')
.where(dfg_user['shipping_cost'].transform('any'))
)
Output:
index user default_shipping_cost category shipping_cost shipping_coalesce estimated_shipping_cost
0 0 1 1 clothes NaN 1.0 6.0
1 1 1 1 electronics 2.0 2.0 6.0
2 2 1 15 furniture NaN 15.0 6.0
3 3 2 15 furniture NaN 15.0 NaN
4 4 2 15 furniture NaN 15.0 NaN
Intermediate:
dfg_user['shipping_cost'].transform('any')
0 True
1 True
2 True
3 False
4 False
Name: shipping_cost, dtype: bool
Try:
valid_users = df.loc[df["shipping_cost"].notna(), "user"].unique()
df["estimated_shipping_cost"] = (
df[df["user"].isin(valid_users)]
.groupby("user")["shipping_coalesce"]
.transform("mean")
)
print(df)
Prints:
user default_shipping_cost category shipping_cost shipping_coalesce estimated_shipping_cost
0 1 1 clothes NaN 1.0 6.0
1 1 1 electronics 2.0 2.0 6.0
2 1 15 furniture NaN 15.0 6.0
3 2 15 furniture NaN 15.0 NaN
4 2 15 furniture NaN 15.0 NaN
Ultilizing panda’s aggregate function, explaination is written in code’s comment:
#aggregate function to check if all shipping_cost of a user is NaN
def testFunc (a):
return a.isnull().values.all()
#applying that aggregate function, applying only to the 'shipping_cost' column
result = df[['user', 'shipping_cost']].groupby (['user']).aggregate(testFunc) [['shipping_cost']]
#Rename to make it cleaner
result = result.rename(columns={"shipping_cost" : "shipping_cost_check"})
#join result
df = df.join (result, on = 'user')
#replace estimated_shipping_cost with nan if the check about return True
df.loc[df['shipping_cost_check'],'estimated_shipping_cost'] = np.nan
#drop the extra column
df = df.drop ('shipping_cost_check', axis=1)
Output:
index user default_shipping_cost category shipping_cost shipping_coalesce estimated_shipping_cost
0 0 1 1 clothes NaN 1.0 6.0
1 1 1 1 electronics 2.0 2.0 6.0
2 2 1 15 furniture NaN 15.0 6.0
3 3 2 15 furniture NaN 15.0 NaN
4 4 2 15 furniture NaN 15.0 NaN
I got the following dataframe:
index user default_shipping_cost category shipping_cost shipping_coalesce estimated_shipping_cost
0 0 1 1 clothes NaN 1.0 6.0
1 1 1 1 electronics 2.0 2.0 6.0
2 2 1 15 furniture NaN 15.0 6.0
3 3 2 15 furniture NaN 15.0 15.0
4 4 2 15 furniture NaN 15.0 15.0
Per user, combine shipping_cost with default_shipping_cost and calculate the mean of the combined shipping_costs but only if there is at least one shipping_cost given.
Explanation:
- user_1
shipping_cost
is given (at least once) so we can calculate the mean - user_2 there are no
shipping_cost
, so I would like to go with Nan
Code:
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option('display.width', 1000)
df = pd.DataFrame(
{
'user': [1, 1, 1, 2, 2],
'default_shipping_cost': [1, 1, 15, 15, 15],
'category': ['clothes', 'electronics', 'furniture', 'furniture', 'furniture'],
'shipping_cost': [None, 2, None, None, None]
}
)
df.reset_index(inplace=True)
df['shipping_coalesce'] = df.shipping_cost.combine_first(df.default_shipping_cost)
dfg_user = df.groupby(['user'])
df['estimated_shipping_cost'] = dfg_user['shipping_coalesce'].transform("mean")
print(df)
Expected output:
index user default_shipping_cost category shipping_cost shipping_coalesce estimated_shipping_cost
0 0 1 1 clothes NaN 1.0 6.0
1 1 1 1 electronics 2.0 2.0 6.0
2 2 1 15 furniture NaN 15.0 6.0
3 3 2 15 furniture NaN 15.0 NaN
4 4 2 15 furniture NaN 15.0 NaN
Add an extra condition with transform('any')
and where
:
df['estimated_shipping_cost'] = (dfg_user['shipping_coalesce'].transform('mean')
.where(dfg_user['shipping_cost'].transform('any'))
)
Output:
index user default_shipping_cost category shipping_cost shipping_coalesce estimated_shipping_cost
0 0 1 1 clothes NaN 1.0 6.0
1 1 1 1 electronics 2.0 2.0 6.0
2 2 1 15 furniture NaN 15.0 6.0
3 3 2 15 furniture NaN 15.0 NaN
4 4 2 15 furniture NaN 15.0 NaN
Intermediate:
dfg_user['shipping_cost'].transform('any')
0 True
1 True
2 True
3 False
4 False
Name: shipping_cost, dtype: bool
Try:
valid_users = df.loc[df["shipping_cost"].notna(), "user"].unique()
df["estimated_shipping_cost"] = (
df[df["user"].isin(valid_users)]
.groupby("user")["shipping_coalesce"]
.transform("mean")
)
print(df)
Prints:
user default_shipping_cost category shipping_cost shipping_coalesce estimated_shipping_cost
0 1 1 clothes NaN 1.0 6.0
1 1 1 electronics 2.0 2.0 6.0
2 1 15 furniture NaN 15.0 6.0
3 2 15 furniture NaN 15.0 NaN
4 2 15 furniture NaN 15.0 NaN
Ultilizing panda’s aggregate function, explaination is written in code’s comment:
#aggregate function to check if all shipping_cost of a user is NaN
def testFunc (a):
return a.isnull().values.all()
#applying that aggregate function, applying only to the 'shipping_cost' column
result = df[['user', 'shipping_cost']].groupby (['user']).aggregate(testFunc) [['shipping_cost']]
#Rename to make it cleaner
result = result.rename(columns={"shipping_cost" : "shipping_cost_check"})
#join result
df = df.join (result, on = 'user')
#replace estimated_shipping_cost with nan if the check about return True
df.loc[df['shipping_cost_check'],'estimated_shipping_cost'] = np.nan
#drop the extra column
df = df.drop ('shipping_cost_check', axis=1)
Output:
index user default_shipping_cost category shipping_cost shipping_coalesce estimated_shipping_cost
0 0 1 1 clothes NaN 1.0 6.0
1 1 1 1 electronics 2.0 2.0 6.0
2 2 1 15 furniture NaN 15.0 6.0
3 3 2 15 furniture NaN 15.0 NaN
4 4 2 15 furniture NaN 15.0 NaN