Get Rankings of Column Names in Pandas Dataframe

Question:

I have pivoted the Customer ID against their most frequently purchased genres of performances:

Genre            Jazz     Dance     Music  Theatre
Customer                                        
100000000001           0      3         1        2
100000000002           0      1         6        2
100000000003           0      3        13        4
100000000004           0      5         4        1
100000000005           1     10        16       14

My desired result is to append the column names according to the rankings:

Genre            Jazz     Dance     Music  Theatre          Rank1          Rank2          Rank3          Rank4
Customer                                         
100000000001           0      3         1        2          Dance        Theatre          Music           Jazz
100000000002           0      1         6        2          Music        Theatre          Dance           Jazz
100000000003           0      3        13        4          Music        Theatre          Dance           Jazz
100000000004           0      5         4        1          Dance          Music        Theatre           Jazz
100000000005           1     10        16       14          Music        Theatre          Dance           Jazz

I have looked up some threads but the closest thing I can find is idxmax. However that only gives me Rank1.

Could anyone help me to get the result I need?

Thanks a lot!

Dennis

Asked By: dendoniseden

||

Answers:

Use:

i = np.argsort(df.to_numpy() * -1, axis=1)
r = pd.DataFrame(df.columns[i], index=df.index, columns=range(1, i.shape[1] + 1)) 
df = df.join(r.add_prefix('Rank'))

Details:

Use np.argsort along axis=1 to get the indices i that would sort the genres in descending order.

print(i)
array([[1, 3, 2, 0],
       [2, 3, 1, 0],
       [2, 3, 1, 0],
       [1, 2, 3, 0],
       [2, 3, 1, 0]])

Create a new dataframe r from the columns of dataframe df taken along the indices i (i.e df.columns[i]), then use DataFrame.join to join the dataframe r with df:

print(df)
              Jazz  Dance  Music  Theatre  Rank1    Rank2    Rank3 Rank4
Customer                                                                
100000000001     0      3      1        2  Dance  Theatre    Music  Jazz
100000000002     0      1      6        2  Music  Theatre    Dance  Jazz
100000000003     0      3     13        4  Music  Theatre    Dance  Jazz
100000000004     0      5      4        1  Dance    Music  Theatre  Jazz
100000000005     1     10     16       14  Music  Theatre    Dance  Jazz
Answered By: Shubham Sharma

Let’s try stack, cumcount and sort_values:

s = df.stack().sort_values(ascending=False).groupby(level=0).cumcount() + 1
s1 = (s.reset_index(1)
    .set_index(0, append=True)
    .unstack(1)
    .add_prefix("Rank")
    
    )
s1.columns = s1.columns.get_level_values(1)

then join back on your customer genre index.

df.join(s1)

                 Jazz  Dance  Music  Theatre  Rank1    Rank2    Rank3 Rank4
Customer_Genre                                                            
100000000001       0      3      1        2  Dance  Theatre    Music  Jazz
100000000002       0      1      6        2  Music  Theatre    Dance  Jazz
100000000003       0      3     13        4  Music  Theatre    Dance  Jazz
100000000004       0      5      4        1  Dance    Music  Theatre  Jazz
100000000005       1     10     16       14  Music  Theatre    Dance  Jazz
Answered By: Umar.H

Try this:

dfp = (df.rank(ascending=False, axis=1).stack()
         .astype(int).rename('rank').reset_index(level=1))
df.assign(**dfp.set_index('rank', append=True)['Genre'].unstack().add_prefix('Rank'))

Output:

Genre         Jazz  Dance  Music  Theatre  Rank1    Rank2    Rank3 Rank4
Customer                                                                
100000000001     0      3      1        2  Dance  Theatre    Music  Jazz
100000000002     0      1      6        2  Music  Theatre    Dance  Jazz
100000000003     0      3     13        4  Music  Theatre    Dance  Jazz
100000000004     0      5      4        1  Dance    Music  Theatre  Jazz
100000000005     1     10     16       14  Music  Theatre    Dance  Jazz

Use rank and reshape dataframe, then join back to original dataframe using assign.

Answered By: Scott Boston

The above solution works, but we now get the below deprecation warning.

r = pd.DataFrame(df.columns[i], index=df.index, columns=range(1, i.shape[1] + 1))

FutureWarning: Support for multi-dimensional indexing (e.g. obj[:, None]) is deprecated and will be removed in a future version. Convert to a numpy
array before indexing instead.

Revised: r = pd.DataFrame(np.array(df.columns)[i], index=df.index, columns=range(1, i.shape[1] + 1))

Answered By: Wally Cheung

Here is a function that improves the previous answers, considering the following:

  • It solves the deprecation warning mentioned by Wally, by converting the df.columns into a numpy array before indexing them.
  • It also allows including NaN values and avoids using those columns for the rank columns (leaving their values as NaN too). Check the example.
  • It also adds the corresponding rank values to map them easily.
  • Has an additional parameter in case you want to rank them in ascending or descending order.
  • Adds an additional column specifying which columns had NaN values and were not included in the rank columns. Those values are added in a list.
# Example DataFrame
import numpy as np
import pandas as pd

dic = {'A': [0, np.nan, 2, np.nan],
      'B': [3, 0, 1, 5],
      'C': [1, 2, 0, np.nan]}
df = pd.DataFrame(dic)
print(df)

     A  B    C
0  0.0  3  1.0
1  NaN  0  2.0
2  2.0  1  0.0
3  NaN  5  NaN
# Function
def fun_rank_columns(df, ascending=False):
    factor = 1 if ascending else -1
    # Rank columns showing ranking of column names
    np_sort = np.argsort(df.to_numpy() * factor, axis=1)
    df_rank = pd.DataFrame(np.array(df.columns)[np_sort], index=df.index, columns=range(1, np_sort.shape[1] + 1))
    
    # Corresponding values for each rank column
    np_sort_value = np.sort(df.to_numpy() * factor, axis=1)
    df_rank_value = pd.DataFrame(np_sort_value, index=df.index, columns=range(1, np_sort_value.shape[1] + 1)) * factor
    
    # Columns with nan values to be replaced
    num_col_rank = df_rank.shape[1]
    df_rank['nan_value'] = df.apply(lambda row: [i for i in df.columns if np.isnan(row[i])], axis=1)
    for col in range(1, num_col_rank + 1):
        condition = df_rank.apply(lambda x: x[col] in x['nan_value'], axis=1)
        df_rank.loc[condition, col] = np.nan
        df_rank_value.loc[condition, col] = np.nan

    # Join Results
    df_rank = df_rank.add_prefix('rank_')
    df_rank_value = df_rank_value.add_prefix('rank_value_')
    df_res = df_rank.join(df_rank_value)
    return df_res
# Apply the function
df_res = fun_rank_columns(df, ascending=True)
print(df_res)

  rank_1 rank_2 rank_3 rank_nan_value  rank_value_1  rank_value_2  rank_value_3
0      A      C      B             []           0.0           1.0           3.0
1      B      C    NaN            [A]           0.0           2.0           NaN
2      C      B      A             []           0.0           1.0           2.0
3      B    NaN    NaN         [A, C]           5.0           NaN           NaN
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.