Get first and second highest values in pandas columns

Question:

I am using pandas to analyse some election results. I have a DF, Results, which has a row for each constituency and columns representing the votes for the various parties (over 100 of them):

In[60]: Results.columns
Out[60]: 
Index(['Constituency', 'Region', 'Country', 'ID', 'Type', 'Electorate',
       'Total', 'Unnamed: 9', '30-50', 'Above',
       ...
       'WP', 'WRP', 'WVPTFP', 'Yorks', 'Young', 'Zeb', 'Party', 'Votes',
       'Share', 'Turnout'],
      dtype='object', length=147) 

So…

In[63]: Results.head()
Out[63]: 
                         Constituency    Region   Country         ID    Type  
PAID                                                                           
1                            Aberavon     Wales     Wales  W07000049  County   
2                           Aberconwy     Wales     Wales  W07000058  County   
3                      Aberdeen North  Scotland  Scotland  S14000001   Burgh   
4                      Aberdeen South  Scotland  Scotland  S14000002   Burgh   
5     Aberdeenshire West & Kincardine  Scotland  Scotland  S14000058  County   

      Electorate  Total  Unnamed: 9  30-50  Above    ...     WP  WRP  WVPTFP  
PAID                                                 ...                       
1          49821  31523         NaN    NaN    NaN    ...    NaN  NaN     NaN   
2          45525  30148         NaN    NaN    NaN    ...    NaN  NaN     NaN   
3          67745  43936         NaN    NaN    NaN    ...    NaN  NaN     NaN   
4          68056  48551         NaN    NaN    NaN    ...    NaN  NaN     NaN   
5          73445  55196         NaN    NaN    NaN    ...    NaN  NaN     NaN   

      Yorks  Young  Zeb  Party  Votes     Share   Turnout  
PAID                                                       
1       NaN    NaN  NaN    Lab  15416  0.489040  0.632725  
2       NaN    NaN  NaN    Con  12513  0.415052  0.662230  
3       NaN    NaN  NaN    SNP  24793  0.564298  0.648550  
4       NaN    NaN  NaN    SNP  20221  0.416490  0.713398  
5       NaN    NaN  NaN    SNP  22949  0.415773  0.751528  

[5 rows x 147 columns]

The per-constituency results for each party are given in the columns Results.ix[:, 'Unnamed: 9': 'Zeb']

I can find the winning party (i.e. the party which polled highest number of votes) and the number of votes it polled using:

RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb']
Results['Party'] = RawResults.idxmax(axis=1)
Results['Votes'] = RawResults.max(axis=1).astype(int)

But, I also need to know how many votes the second-place party got (and ideally its index/name). So is there any way in pandas to return the second highest value/index in a set of columns for each row?

Asked By: TimGJ

||

Answers:

You could just sort your results, such that the first rows will contain the max. Then you can simply use indexing to get the first n places.

RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb'].sort_values(by='votes', ascending=False)
RawResults.iloc[0, :] # First place
RawResults.iloc[1, :] # Second place
RawResults.iloc[n, :] # nth place
Answered By: Kartik

Here is a NumPy solution:

In [120]: df
Out[120]:
          a         b         c         d         e         f         g         h
0  1.334444  0.322029  0.302296 -0.841236 -0.360488 -0.860188 -0.157942  1.522082
1  2.056572  0.991643  0.160067 -0.066473  0.235132  0.533202  1.282371 -2.050731
2  0.955586 -0.966734  0.055210 -0.993924 -0.553841  0.173793 -0.534548 -1.796006
3  1.201001  1.067291 -0.562357 -0.794284 -0.554820 -0.011836  0.519928  0.514669
4 -0.243972 -0.048144  0.498007  0.862016  1.284717 -0.886455 -0.757603  0.541992
5  0.739435 -0.767399  1.574173  1.197063 -1.147961 -0.903858  0.011073 -1.404868
6 -1.258282 -0.049719  0.400063  0.611456  0.443289 -1.110945  1.352029  0.215460
7  0.029121 -0.771431 -0.285119 -0.018216  0.408425 -1.458476 -1.363583  0.155134
8  1.427226 -1.005345  0.208665 -0.674917  0.287929 -1.259707  0.220420 -1.087245
9  0.452589  0.214592 -1.875423  0.487496  2.411265  0.062324 -0.327891  0.256577

In [121]: np.sort(df.values)[:,-2:]
Out[121]:
array([[ 1.33444404,  1.52208164],
       [ 1.28237078,  2.05657214],
       [ 0.17379254,  0.95558613],
       [ 1.06729107,  1.20100071],
       [ 0.86201603,  1.28471676],
       [ 1.19706331,  1.57417327],
       [ 0.61145573,  1.35202868],
       [ 0.15513379,  0.40842477],
       [ 0.28792928,  1.42722604],
       [ 0.48749578,  2.41126532]])

or as a pandas Data Frame:

In [122]: pd.DataFrame(np.sort(df.values)[:,-2:], columns=['2nd-largest','largest'])
Out[122]:
   2nd-largest   largest
0     1.334444  1.522082
1     1.282371  2.056572
2     0.173793  0.955586
3     1.067291  1.201001
4     0.862016  1.284717
5     1.197063  1.574173
6     0.611456  1.352029
7     0.155134  0.408425
8     0.287929  1.427226
9     0.487496  2.411265

or a faster solution from @Divakar:

In [6]: df
Out[6]:
          a         b         c         d         e         f         g         h
0  0.649517 -0.223116  0.264734 -1.121666  0.151591 -1.335756 -0.155459 -2.500680
1  0.172981  1.233523  0.220378  1.188080 -0.289469 -0.039150  1.476852  0.736908
2 -1.904024  0.109314  0.045741 -0.341214 -0.332267 -1.363889  0.177705 -0.892018
3 -2.606532 -0.483314  0.054624  0.979734  0.205173  0.350247 -1.088776  1.501327
4  1.627655 -1.261631  0.589899 -0.660119  0.742390 -1.088103  0.228557  0.714746
5  0.423972 -0.506975 -0.783718 -2.044002 -0.692734  0.980399  1.007460  0.161516
6 -0.777123 -0.838311 -1.116104 -0.433797  0.599724 -0.884832 -0.086431 -0.738298
7  1.131621  1.218199  0.645709  0.066216 -0.265023  0.606963 -0.194694  0.463576
8  0.421164  0.626731 -0.547738  0.989820 -1.383061 -0.060413 -1.342769 -0.777907
9 -1.152690  0.696714 -0.155727 -0.991975 -0.806530  1.454522  0.788688  0.409516

In [7]: a = df.values

In [8]: a[np.arange(len(df))[:,None],np.argpartition(-a,np.arange(2),axis=1)[:,:2]]
Out[8]:
array([[ 0.64951665,  0.26473378],
       [ 1.47685226,  1.23352348],
       [ 0.17770473,  0.10931398],
       [ 1.50132666,  0.97973383],
       [ 1.62765464,  0.74238959],
       [ 1.00745981,  0.98039898],
       [ 0.5997243 , -0.0864306 ],
       [ 1.21819904,  1.13162068],
       [ 0.98982033,  0.62673128],
       [ 1.45452173,  0.78868785]])

To get the highest values of a column, you can use nlargest() :

df['High'].nlargest(2)

The above will give you the 2 highest values of column High.


You can also use nsmallest() to get the lowest values.

Answered By: Pedro Lobito

Here is a solution using nlargest function:

>>> df
    a   b   c
 0  4  20   2
 1  5  10   2
 2  3  40   5
 3  1  50  10
 4  2  30  15
>>> def give_largest(col,n):
...     largest = col.nlargest(n).reset_index(drop = True)
...     data = [x for x in largest]
...     index = [f'{i}_largest' for i in range(1,len(largest)+1)]
...     return pd.Series(data,index=index)
...
...
>>> def n_largest(df, axis, n):
...     '''
...     Function to return the n-largest value of each
...     column/row of the input DataFrame.
...     '''
...     return df.apply(give_largest, axis = axis, n = n)
...
>>> n_largest(df,axis = 1, n = 2)
   1_largest  2_largest
0         20          4
1         10          5
2         40          5
3         50         10
4         30         15
>>> n_largest(df,axis = 0, n = 2)
                  a           b           c     
1_largest         5          50           15
2_largest         4          40           10
Answered By: Darsh Shukla

Here is an interesting approach. What if we replace the maximum value with the minimum value and calculate. Although it is a quick hack and, not recommended!

first_highest_value_index = df.idxmax()
second_highest_value_index = df.replace(df.max(),df(min)).idxmax()

first_highest_value = df[first_highest_value_index]
second_highest_value = df[second_highest_value_index]
Answered By: Amit Ghosh
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'a': [4, 5, 3, 1, 2],
    'b': [20, 10, 40, 50, 30],
    'c': [25, 20, 5, 15, 10]
})
def second_largest(df):
    
    return (df.nlargest(2).min())




print(df.apply(second_largest))
a     4
b    40
c    20
dtype: int64
Answered By: Felix Le
df
           a           b           c            d           e           f           g          h
0   1.334444    0.322029    0.302296    -0.841236   -0.360488   -0.860188   -0.157942   1.522082
1   2.056572    0.991643    0.160067    -0.066473   0.235132    0.533202    1.282371    -2.050731
2   0.955586    -0.966734   0.055210    -0.993924   -0.553841   0.173793    -0.534548   -1.796006
3   1.201001    1.067291    -0.562357   -0.794284   -0.554820   -0.011836   0.519928    0.514669
4   -0.243972   -0.048144   0.498007    0.862016    1.284717    -0.886455   -0.757603   0.541992
5   0.739435    -0.767399   1.574173    1.197063    -1.147961   -0.903858   0.011073    -1.404868
6   -1.258282   -0.049719   0.400063    0.611456    0.443289    -1.110945   1.352029    0.215460
7   0.029121    -0.771431   -0.285119   -0.018216   0.408425    -1.458476   -1.363583   0.155134
8   1.427226    -1.005345   0.208665    -0.674917   0.287929    -1.259707   0.220420    -1.087245
9   0.452589    0.214592    -1.875423   0.487496    2.411265    0.062324    -0.327891   0.256577

tranpose and use nlargest in a for loop to get the results order by each line:

df1=df.T
results=list()
for col in df1.columns: results.append(df1[col].nlargest(len(df.columns))

the results var is a list of pandas objects, where the first item on the list will be the df’s first row sorted in descending order and so on. Since each item on the list is a pandas object, it carries df’s column as index (it was transposed), so you will get the values and the df’s columns name of each row sorted

results
[h    1.522082
 a    1.334444
 b    0.322029
 c    0.302296
 g   -0.157942
 e   -0.360488
 d   -0.841236
 f   -0.860188
 Name: 0, dtype: float64,
 a    2.056572
 g    1.282371
 b    0.991643
 f    0.533202
 e    0.235132
 c    0.160067
 d   -0.066473
 h   -2.050731
 Name: 1, dtype: float64,
....
Answered By: Fernando

Since the sort is applied to every column, the index labels will be dropped anyway. We can use numpy to achieve this:

a=np.random.random([10,4])
a
Out[10]: 
array([[0.11965879, 0.73168303, 0.18521419, 0.13992709],
       [0.8758143 , 0.13136781, 0.22071495, 0.9369399 ],
       [0.9190763 , 0.14320333, 0.66614619, 0.60688266],
       [0.91973194, 0.25297882, 0.02097999, 0.93834247],
       [0.70570349, 0.92229061, 0.64739799, 0.21614292],
       [0.70386848, 0.58542997, 0.78042638, 0.65968369],
       [0.37626386, 0.32668468, 0.42756616, 0.00938118],
       [0.85126718, 0.76305889, 0.03047206, 0.4874788 ],
       [0.88951106, 0.65035676, 0.98210505, 0.35605393],
       [0.96369325, 0.79311159, 0.81995022, 0.10588205]])
np.sort(a, axis=0)

We have output:

array([[0.11965879, 0.13136781, 0.02097999, 0.00938118],
       [0.37626386, 0.14320333, 0.03047206, 0.10588205],
       [0.70386848, 0.25297882, 0.18521419, 0.13992709],
       [0.70570349, 0.32668468, 0.22071495, 0.21614292],
       [0.85126718, 0.58542997, 0.42756616, 0.35605393],
       [0.8758143 , 0.65035676, 0.64739799, 0.4874788 ],
       [0.88951106, 0.73168303, 0.66614619, 0.60688266],
       [0.9190763 , 0.76305889, 0.78042638, 0.65968369],
       [0.91973194, 0.79311159, 0.81995022, 0.9369399 ],
       [0.96369325, 0.92229061, 0.98210505, 0.93834247]])

Here each column is sorted and you can slice the matrix to get the desired values.

Answered By: Feng Bao
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.