Find the column name which has the maximum value for each row

Question

I have a DataFrame like this one:

Communications and Search   Business    General Lifestyle
0   0.745763    0.050847    0.118644    0.084746
0   0.333333    0.000000    0.583333    0.083333
0   0.617021    0.042553    0.297872    0.042553
0   0.435897    0.000000    0.410256    0.153846
0   0.358974    0.076923    0.410256    0.153846

I want to get the column name which has maximum value for each row. The desired output is like this:

Communications and Search   Business    General Lifestyle  Max
0   0.745763    0.050847    0.118644    0.084746           Communications 
0   0.333333    0.000000    0.583333    0.083333           Business  
0   0.617021    0.042553    0.297872    0.042553           Communications 
0   0.435897    0.000000    0.410256    0.153846           Communications 
0   0.358974    0.076923    0.410256    0.153846           Business

Asked By: markov zain

||

Source

Answer 1

You could apply on dataframe and get argmax() of each row via axis=1

In [144]: df.apply(lambda x: x.argmax(), axis=1)
Out[144]:
0    Communications
1          Business
2    Communications
3    Communications
4          Business
dtype: object

Here’s a benchmark to compare how slow apply method is to idxmax() for len(df) ~ 20K

In [146]: %timeit df.apply(lambda x: x.argmax(), axis=1)
1 loops, best of 3: 479 ms per loop

In [147]: %timeit df.idxmax(axis=1)
10 loops, best of 3: 47.3 ms per loop

Answered By: Zero

Answer 2

You can use idxmax with axis=1 to find the column with the greatest value on each row:

>>> df.idxmax(axis=1)
0    Communications
1          Business
2    Communications
3    Communications
4          Business
dtype: object

To create the new column ‘Max’, use df['Max'] = df.idxmax(axis=1).

To find the row index at which the maximum value occurs in each column, use df.idxmax() (or equivalently df.idxmax(axis=0)).

Answered By: Alex Riley

Answer 3

And if you want to produce a column containing the name of the column with the maximum value but considering only a subset of columns then you use a variation of @ajcr’s answer:

df['Max'] = df[['Communications','Business']].idxmax(axis=1)

Answered By: user1718097

Answer 4

Another solution is to flag the position of the maximum values of each row and get the corresponding column names. In particular, this solution works well if multiple columns contain the maximum value for some rows and you want to return all column names with the maximum value for each row:¹

Code:

# look for the max values in each row
mxs = df.eq(df.max(axis=1), axis=0)
# join the column names of the max values of each row into a single string
df['Max'] = mxs.dot(mxs.columns + ', ').str.rstrip(', ')

A slight variation: If you want to pick one column randomly when multiple columns contain the maximum value:

Code:

mxs = df.eq(df.max(axis=1), axis=0)
df['Max'] = mxs.where(mxs).stack().groupby(level=0).sample(n=1).index.get_level_values(1)

You can also do this for specific columns by selecting the columns:

# for column names of max value of each row
cols = ['Communications', 'Search', 'Business']
mxs = df[cols].eq(df[cols].max(axis=1), axis=0)
df['max among cols'] = mxs.dot(mxs.columns + ', ').str.rstrip(', ')

^{1: idxmax(1) returns only the first column name with the max value if the max value is the same for multiple columns, which may not be desirable depending on the use case. This solution generalizes idxmax(1); in particular, if the max values are unique in each row, it matches the idxmax(1) solution.}

Answered By: cottontail

Answer 5

Using numpy argmax is blazing fast. I’ve tested in a dataframe with 3,744,965 rows and it takes 103ms.

%timeit df.idxmax(axis=1)
7.67 s ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.columns[df.to_numpy().argmax(axis=1)]
103 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answered By: emada

Find the column name which has the maximum value for each row

Question:

Answers: