Find the column name of the second largest value of each row in a Pandas DataFrame
Question:
I am trying to find column name associated with the largest and second largest values in a DataFrame, here’s a simplified example (the real one has over 500 columns):
Date val1 val2 val3 val4
1990 5 7 1 10
1991 2 1 10 3
1992 10 9 6 1
1993 50 10 2 15
1994 1 15 7 8
Needs to become:
Date 1larg 2larg
1990 val4 val2
1991 val3 val4
1992 val1 val2
1993 val1 val4
1994 val2 val4
I can find the column name with the largest value (i,e, 1larg above) with idxmax, but how can I find the second largest?
Answers:
(You don’t have any duplicate maximum values in your rows, so I’ll guess that if you have [1,1,2,2]
you want val3
and val4
to be selected.)
One way would be to use the result of argsort
as an index into a Series with the column names.
df = df.set_index("Date")
arank = df.apply(np.argsort, axis=1)
ranked_cols = df.columns.to_series()[arank.values[:,::-1][:,:2]]
new_frame = pd.DataFrame(ranked_cols, index=df.index)
produces
0 1
Date
1990 val4 val2
1991 val3 val4
1992 val1 val2
1993 val1 val4
1994 val2 val4
1995 val4 val3
(where I’ve added an extra 1995 [1,1,2,2]
row.)
Alternatively, you could probably melt
into a flat format, pick out the largest two values in each Date group, and then turn it again.
We could use idxmax
to find the column name of the highest value for each row; then mask
the highest value in each row and use idxmax
again to find the column names of the second values:
g = df.filter(like='val')
df['1larg'] = g.idxmax(axis=1)
df['2larg'] = g.mask(g.eq(g.max(axis=1), axis=0)).idxmax(axis=1)
Note that this works only if each row has a unique highest value. If that’s not the case, since, the second highest values are the same as the highest values for such rows, the above method won’t work. In that case, use the code below, where we mask
only the first occurrence of the max value in each row:
g = df.filter(like='val')
df['1larg'] = g.idxmax(axis=1)
df['2larg'] = g.mask(g.eq(g.max(axis=1), axis=0) & g.apply(lambda x: ~x.duplicated(), axis=1)).idxmax(axis=1)
Output:
Date val1 val2 val3 val4 1larg 2larg
0 1990 5 7 1 10 val4 val2
1 1991 2 1 10 3 val3 val4
2 1992 10 9 6 1 val1 val2
3 1993 50 10 2 15 val1 val4
4 1994 1 15 7 8 val2 val4
What worked for me:
def flatten(l):
return [item for sublist in l for item in sublist]
df = df.set_index("Date")
arank = df.apply(np.argsort, axis=1)
ranked_cols = df.columns.to_series()[arank.values[:,::-1][:,:2]]
new_frame = pd.DataFrame(ranked_cols, index=df.index)
#highest value
first_val = df.columns.to_series()[flatten(arank.values[:,::-1][:,:1].tolist())].reset_index()
#2nd value
sec_val = df.columns.to_series()[flatten(arank.values[:,::-1][:,1:2].tolist())].reset_index()
#just pretty names
first_val.columns = ['first_cat1', 'first_cat2']
sec_val.columns = ['sec_cat1', 'sec_cat2']
#combine into new df with both columns
new_df = pd.concat([first_val['first_cat1'], sec_val['sec_cat2']],axis=1))
I am trying to find column name associated with the largest and second largest values in a DataFrame, here’s a simplified example (the real one has over 500 columns):
Date val1 val2 val3 val4
1990 5 7 1 10
1991 2 1 10 3
1992 10 9 6 1
1993 50 10 2 15
1994 1 15 7 8
Needs to become:
Date 1larg 2larg
1990 val4 val2
1991 val3 val4
1992 val1 val2
1993 val1 val4
1994 val2 val4
I can find the column name with the largest value (i,e, 1larg above) with idxmax, but how can I find the second largest?
(You don’t have any duplicate maximum values in your rows, so I’ll guess that if you have [1,1,2,2]
you want val3
and val4
to be selected.)
One way would be to use the result of argsort
as an index into a Series with the column names.
df = df.set_index("Date")
arank = df.apply(np.argsort, axis=1)
ranked_cols = df.columns.to_series()[arank.values[:,::-1][:,:2]]
new_frame = pd.DataFrame(ranked_cols, index=df.index)
produces
0 1
Date
1990 val4 val2
1991 val3 val4
1992 val1 val2
1993 val1 val4
1994 val2 val4
1995 val4 val3
(where I’ve added an extra 1995 [1,1,2,2]
row.)
Alternatively, you could probably melt
into a flat format, pick out the largest two values in each Date group, and then turn it again.
We could use idxmax
to find the column name of the highest value for each row; then mask
the highest value in each row and use idxmax
again to find the column names of the second values:
g = df.filter(like='val')
df['1larg'] = g.idxmax(axis=1)
df['2larg'] = g.mask(g.eq(g.max(axis=1), axis=0)).idxmax(axis=1)
Note that this works only if each row has a unique highest value. If that’s not the case, since, the second highest values are the same as the highest values for such rows, the above method won’t work. In that case, use the code below, where we mask
only the first occurrence of the max value in each row:
g = df.filter(like='val')
df['1larg'] = g.idxmax(axis=1)
df['2larg'] = g.mask(g.eq(g.max(axis=1), axis=0) & g.apply(lambda x: ~x.duplicated(), axis=1)).idxmax(axis=1)
Output:
Date val1 val2 val3 val4 1larg 2larg
0 1990 5 7 1 10 val4 val2
1 1991 2 1 10 3 val3 val4
2 1992 10 9 6 1 val1 val2
3 1993 50 10 2 15 val1 val4
4 1994 1 15 7 8 val2 val4
What worked for me:
def flatten(l):
return [item for sublist in l for item in sublist]
df = df.set_index("Date")
arank = df.apply(np.argsort, axis=1)
ranked_cols = df.columns.to_series()[arank.values[:,::-1][:,:2]]
new_frame = pd.DataFrame(ranked_cols, index=df.index)
#highest value
first_val = df.columns.to_series()[flatten(arank.values[:,::-1][:,:1].tolist())].reset_index()
#2nd value
sec_val = df.columns.to_series()[flatten(arank.values[:,::-1][:,1:2].tolist())].reset_index()
#just pretty names
first_val.columns = ['first_cat1', 'first_cat2']
sec_val.columns = ['sec_cat1', 'sec_cat2']
#combine into new df with both columns
new_df = pd.concat([first_val['first_cat1'], sec_val['sec_cat2']],axis=1))