Find the column name of the second largest value of each row in a Pandas DataFrame

Question:

I am trying to find column name associated with the largest and second largest values in a DataFrame, here’s a simplified example (the real one has over 500 columns):

Date  val1  val2 val3 val4
1990   5     7    1    10
1991   2     1    10   3
1992   10    9    6    1
1993   50    10   2    15
1994   1     15   7    8

Needs to become:

Date  1larg   2larg
1990  val4    val2
1991  val3    val4
1992  val1    val2
1993  val1    val4
1994  val2    val4

I can find the column name with the largest value (i,e, 1larg above) with idxmax, but how can I find the second largest?

Asked By: AtotheSiv

||

Answers:

(You don’t have any duplicate maximum values in your rows, so I’ll guess that if you have [1,1,2,2] you want val3 and val4 to be selected.)

One way would be to use the result of argsort as an index into a Series with the column names.

df = df.set_index("Date")
arank = df.apply(np.argsort, axis=1)
ranked_cols = df.columns.to_series()[arank.values[:,::-1][:,:2]]
new_frame = pd.DataFrame(ranked_cols, index=df.index)

produces

         0     1
Date            
1990  val4  val2
1991  val3  val4
1992  val1  val2
1993  val1  val4
1994  val2  val4
1995  val4  val3

(where I’ve added an extra 1995 [1,1,2,2] row.)

Alternatively, you could probably melt into a flat format, pick out the largest two values in each Date group, and then turn it again.

Answered By: DSM

We could use idxmax to find the column name of the highest value for each row; then mask the highest value in each row and use idxmax again to find the column names of the second values:

g = df.filter(like='val')
df['1larg'] = g.idxmax(axis=1)
df['2larg'] = g.mask(g.eq(g.max(axis=1), axis=0)).idxmax(axis=1)

Note that this works only if each row has a unique highest value. If that’s not the case, since, the second highest values are the same as the highest values for such rows, the above method won’t work. In that case, use the code below, where we mask only the first occurrence of the max value in each row:

g = df.filter(like='val')
df['1larg'] = g.idxmax(axis=1)
df['2larg'] = g.mask(g.eq(g.max(axis=1), axis=0) & g.apply(lambda x: ~x.duplicated(), axis=1)).idxmax(axis=1)

Output:

   Date  val1  val2  val3  val4 1larg 2larg
0  1990     5     7     1    10  val4  val2
1  1991     2     1    10     3  val3  val4
2  1992    10     9     6     1  val1  val2
3  1993    50    10     2    15  val1  val4
4  1994     1    15     7     8  val2  val4
Answered By: user7864386

What worked for me:

def flatten(l):
    return [item for sublist in l for item in sublist]

df = df.set_index("Date")
arank = df.apply(np.argsort, axis=1)
ranked_cols = df.columns.to_series()[arank.values[:,::-1][:,:2]]
new_frame = pd.DataFrame(ranked_cols, index=df.index)

#highest value
first_val = df.columns.to_series()[flatten(arank.values[:,::-1][:,:1].tolist())].reset_index()

#2nd value
sec_val = df.columns.to_series()[flatten(arank.values[:,::-1][:,1:2].tolist())].reset_index()

#just pretty names
first_val.columns = ['first_cat1', 'first_cat2']
sec_val.columns = ['sec_cat1', 'sec_cat2']    

#combine into new df with both columns
new_df = pd.concat([first_val['first_cat1'], sec_val['sec_cat2']],axis=1))
Answered By: Super Mario