Get the column names for 2nd largest value for each row in a Pandas dataframe

Question:

Say I have such Pandas dataframe

df = pd.DataFrame({
    'a': [4, 5, 3, 1, 2],
    'b': [20, 10, 40, 50, 30],
    'c': [25, 20, 5, 15, 10]
})

so df looks like:

print(df)
   a   b   c
0  4  20  25
1  5  10  20
2  3  40   5
3  1  50  15
4  2  30  10

And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:

def second_largest(l = []):    
    return (l.nlargest(2).min())

print(df.apply(second_largest, axis = 1))

which gives me:

0    20
1    10
2     5
3    15
4    10
dtype: int64

But what I really want is the column names for those values, or to say:

0    b
1    b
2    c
3    c
4    c

Pandas has a function idxmax which can do the job for the largest value:

df.idxmax(axis = 1)

0    c
1    c
2    b
3    b
4    b
dtype: object

Is there any elegant way to do the same job but for the 2nd largest value?

Asked By: Xiaokang

||

Answers:

Use numpy.argsort for positions of second largest values:

df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
   a   b   c new
0  4  20  25   b
1  5  10  20   b
2  3  40   5   c
3  1  50  15   c
4  2  30  10   c

Your solution should working, but is slow:

def second_largest(l = []):    
    return (l.nlargest(2).idxmin())

print(df.apply(second_largest, axis = 1))
Answered By: jezrael

If efficiency is important, numpy.argpartition is quite efficient:

N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)

If you want a pure pandas (less efficient):

out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])

Output:

0    b
1    b
2    c
3    c
4    c
dtype: object
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.