Combine rows and average column if another column is minimum

Question:

I have a pandas dataframe:

             Server  Clock 1  Clock 2  Power   diff
0  PhysicalWindows1     3400   3300.0   58.5  100.0
1  PhysicalWindows1     3400   3500.0   63.0  100.0
2  PhysicalWindows1     3400   2900.0   25.0  500.0
3  PhysicalWindows2     3600   3300.0   83.8  300.0
4  PhysicalWindows2     3600   3500.0   65.0  100.0
5  PhysicalWindows2     3600   2900.0   10.0  700.0
6    PhysicalLinux1     2600      NaN    NaN    NaN
7    PhysicalLinux1     2600      NaN    NaN    NaN
8              Test     2700   2700.0   30.0    0.0

Basically, I would like to average the Power for each server but only if the difference is minimum. For example, if you look at the ‘PhysicalWindows1’ server, I have 3 rows, two have a diff of 100, and one has a diff of 500. Since I have two rows with a diff of 100, I would like to average out my Power of 58.5 and 63.0. For ‘PhysicalWindows2’, since there is only one row which has the least diff, we return the power for that one row – 65. If NaN, return Nan, and if there is only one match, return the power for that one match.

My resultant dataframe would look like this:

             Server  Clock 1            Power  
0  PhysicalWindows1     3400    (58.5+63.0)/2
1  PhysicalWindows2     3600             65.0
2    PhysicalLinux1     2600              NaN
3              Test     2700             30.0
Asked By: ReactNewbie123

||

Answers:

Here is a possible solution using df.groupby() and pd.merge()

grp_df = df.groupby(['Server', 'diff'])['Power'].mean().reset_index()
grp_df = grp_df.groupby('Server').first().reset_index()
grp_df = grp_df.rename(columns={'diff': 'min_diff', 'Power': 'Power_avg'})

df_out = (pd.merge(df[['Server', 'Clock 1']].drop_duplicates(), grp_df, on='Server', how='left')
            .drop(['min_diff'], axis=1))
print(df_out)

             Server  Clock 1  Power_avg
0  PhysicalWindows1     3400      60.75
1  PhysicalWindows2     3600      65.00
2    PhysicalLinux1     2600        NaN
3              Test     2700      30.00
Answered By: Jamiu S.

Use groupby with dropna=False to avoid to remove PhysicalLinux1 and sort=True to sort index level (lowest diff on top) then drop_duplicates to keep only one instance of (Server, Clock 1):

out = (df.groupby(['Server', 'Clock 1', 'diff'], dropna=False, sort=True)['Power']
         .mean().droplevel('diff').reset_index().drop_duplicates(['Server', 'Clock 1']))

# Output
             Server  Clock 1  Power
0    PhysicalLinux1     2600    NaN
1  PhysicalWindows1     3400  60.75
3  PhysicalWindows2     3600  65.00
6              Test     2700  30.00
Answered By: Corralien

Use a double groupby, first groupby.transform to mask the non-max Power, then groupby.agg to aggregate

m = df.groupby('Server')['diff'].transform('min').eq(df['diff'])

(df.assign(Power=df['Power'].where(m))
 .groupby('Server', sort=False, as_index=False)
 .agg({'Clock 1': 'first', 'Power': 'mean'})
)

Output:

             Server  Clock 1  Power
0  PhysicalWindows1     3400  60.75
1  PhysicalWindows2     3600  65.00
2    PhysicalLinux1     2600    NaN
3              Test     2700  30.00
Answered By: mozway
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "Server": ['PhysicalWindows1', 'PhysicalWindows1', 'PhysicalWindows1', 'PhysicalWindows2', 
                   'PhysicalWindows2', 'PhysicalWindows2', 'PhysicalLinux1', 'PhysicalLinux1', 'Test'],
        "Clock 1": [3400, 3400, 3400, 3600, 3600, 3600, 2600, 2600, 2700],
        "Clock 2": [3300.0, 3500.0, 2900.0, 3300.0, 3500.0, 2900.0, np.nan, np.nan, 2700.0],
        "Power": [58.5, 63.0, 25.0, 83.8, 65.0, 10.0, np.nan, np.nan, 30.0],
        "diff": [100.0, 100.0, 500.0, 300.0, 100.0, 700.0, np.nan, np.nan, 0.0]
    }
)

r = (df.groupby(['Server'])
       .apply(lambda d: d[d['diff']==d['diff'].min()])
       .reset_index(drop=True)
       .groupby(['Server'])
       .agg({"Clock 1":'mean', "Power":'mean', "diff":'first'})
       .reset_index()
     )

r = (r.append(df[df['diff'].isnull()]
                           .drop_duplicates()
                           ).drop(['Clock 2', 'diff'], axis=1)
                            .reset_index(drop=True)
                            )

print(r)
             Server  Clock 1  Power
0  PhysicalWindows1   3400.0  60.75
1  PhysicalWindows2   3600.0  65.00
2              Test   2700.0  30.00
3    PhysicalLinux1   2600.0    NaN
Answered By: Laurent B.
def function1(dd:pd.DataFrame):
    dd1=dd.loc[dd.loc[:,"diff"].eq(dd.loc[:,"diff"].min())]
    return pd.DataFrame({"Clock 1":dd1[["Clock 1"]].min().squeeze(),"Power":dd1.Power.mean()},index=[dd.name])

df1.groupby('Server',sort=False).apply(function1).droplevel(1)

out

                 Clock 1  Power
Server                          
PhysicalWindows1   3400.0  60.75
PhysicalWindows2   3600.0  65.00
PhysicalLinux1        NaN    NaN
Test               2700.0  30.00
Answered By: G.G
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.