How to create new columns with name of columns in list with the highest value per ID, metioned after coma if need in Python Pandas?

Question:

I have Pandas DataFrame like below (I can add that my DataFrame is definitely bigger, so I need to do below aggregation only for selected columns):

ID   | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111  | 10          | 10          | 320       | 120
222  | 15          | 80          | 500       | 500
333  | 0           | 0           | 110       | 350
444  | 20          | 5           | 0         | 0
555  | 0           | 0           | 0         | 0
666  | 10          | 20          | 60        | 50

Requirements:

  • I need to create new column "TOP_COUNT_2" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,

    • if some ID has same values in all "COUNT_" columns take to "TOP_COUNT_2" all columns names with prefix "COUNT_" mentioned after the decimal point
  • I need to create new column "TOP_SUM_2" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,

    • if some ID has same values in all "SUM_" columns take to "TOP_SUM_2" all columns names with prefix "COUNT_" mentioned after the decimal point
  • If there is 0 in both columns with prefix COUNT_ then give NaN in column TOP_COUNT

  • If there is 0 in both columns with prefix SUM_ then give NaN in column TOP_SUM

Desire output:

ID   | CONT_COL_A  | CNT_COL_B   | SUM_COL_A | SUM_COL_B  | TOP_COUNT_2          | TOP_SUM_2   
-----|-------------|-------------|-----------|------------|----------------------|-----------
111  | 10          | 10          | 320       | 120        | CNT_COL_A, CNT_COL_B | SUM_COL_A
222  | 15          | 80          | 500       | 500        | COUNT_COL_B          | SUM_COL_A, SUM_COL_B
333  | 0           | 0           | 110       | 350        | NaN                  | SUM_COL_B
444  | 20          | 5           | 0         | 0          | COUNT_COL_A          | NaN
555  | 0           | 0           | 0         | 0          | NaN                  | NaN
666  | 10          | 20          | 60        | 50         | COUNT_COL_B          | SUM_COL_A

How can i do that in Python Pandas ?

Asked By: dingaro

||

Answers:

First create mask for processing only non 0 only rows by DataFrame.any with Series.ne with boolean indexing and DataFrame.loc, then compare by maximal values and for join column names use DataFrame.dot trick with columns names with separator:

cols1 = ['COUNT_COL_A' , 'COUNT_COL_B']
cols2 = ['SUM_COL_A','SUM_COL_B']

m1 = df[cols1].ne(0).any(axis=1)
m2 = df[cols2].ne(0).any(axis=1)

df1 = df.loc[m1, cols1]
df2 = df.loc[m2, cols2]

df['TOP_COUNT_2'] = df1.eq(df1.max(axis=1), axis=0).dot(df1.columns + ', ').str[:-2]
df['TOP_SUM_2'] = df2.eq(df2.max(axis=1), axis=0).dot(df2.columns + ', ').str[:-2]
print (df)
    ID  COUNT_COL_A  COUNT_COL_B  SUM_COL_A  SUM_COL_B  
0  111           10           10        320        120   
1  222           15           80        500        500   
2  333            0            0        110        350   
3  444           20            5          0          0   
4  555            0            0          0          0   
5  666           10           20         60         50   

                TOP_COUNT_2             TOP_SUM_2  
0  COUNT_COL_A, COUNT_COL_B             SUM_COL_A  
1               COUNT_COL_B  SUM_COL_A, SUM_COL_B  
2                       NaN             SUM_COL_B  
3               COUNT_COL_A                   NaN  
4                       NaN                   NaN  
5               COUNT_COL_B             SUM_COL_A  
Answered By: jezrael
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.