Get Rankings of Column Names in Pandas Dataframe
Question:
I have pivoted the Customer ID against their most frequently purchased genres of performances:
Genre Jazz Dance Music Theatre
Customer
100000000001 0 3 1 2
100000000002 0 1 6 2
100000000003 0 3 13 4
100000000004 0 5 4 1
100000000005 1 10 16 14
My desired result is to append the column names according to the rankings:
Genre Jazz Dance Music Theatre Rank1 Rank2 Rank3 Rank4
Customer
100000000001 0 3 1 2 Dance Theatre Music Jazz
100000000002 0 1 6 2 Music Theatre Dance Jazz
100000000003 0 3 13 4 Music Theatre Dance Jazz
100000000004 0 5 4 1 Dance Music Theatre Jazz
100000000005 1 10 16 14 Music Theatre Dance Jazz
I have looked up some threads but the closest thing I can find is idxmax
. However that only gives me Rank1
.
Could anyone help me to get the result I need?
Thanks a lot!
Dennis
Answers:
Use:
i = np.argsort(df.to_numpy() * -1, axis=1)
r = pd.DataFrame(df.columns[i], index=df.index, columns=range(1, i.shape[1] + 1))
df = df.join(r.add_prefix('Rank'))
Details:
Use np.argsort
along axis=1
to get the indices i
that would sort the genres in descending order.
print(i)
array([[1, 3, 2, 0],
[2, 3, 1, 0],
[2, 3, 1, 0],
[1, 2, 3, 0],
[2, 3, 1, 0]])
Create a new dataframe r
from the columns of dataframe df
taken along the indices i
(i.e df.columns[i]
), then use DataFrame.join
to join the dataframe r
with df
:
print(df)
Jazz Dance Music Theatre Rank1 Rank2 Rank3 Rank4
Customer
100000000001 0 3 1 2 Dance Theatre Music Jazz
100000000002 0 1 6 2 Music Theatre Dance Jazz
100000000003 0 3 13 4 Music Theatre Dance Jazz
100000000004 0 5 4 1 Dance Music Theatre Jazz
100000000005 1 10 16 14 Music Theatre Dance Jazz
Let’s try stack
, cumcount
and sort_values
:
s = df.stack().sort_values(ascending=False).groupby(level=0).cumcount() + 1
s1 = (s.reset_index(1)
.set_index(0, append=True)
.unstack(1)
.add_prefix("Rank")
)
s1.columns = s1.columns.get_level_values(1)
then join back on your customer genre index.
df.join(s1)
Jazz Dance Music Theatre Rank1 Rank2 Rank3 Rank4
Customer_Genre
100000000001 0 3 1 2 Dance Theatre Music Jazz
100000000002 0 1 6 2 Music Theatre Dance Jazz
100000000003 0 3 13 4 Music Theatre Dance Jazz
100000000004 0 5 4 1 Dance Music Theatre Jazz
100000000005 1 10 16 14 Music Theatre Dance Jazz
Try this:
dfp = (df.rank(ascending=False, axis=1).stack()
.astype(int).rename('rank').reset_index(level=1))
df.assign(**dfp.set_index('rank', append=True)['Genre'].unstack().add_prefix('Rank'))
Output:
Genre Jazz Dance Music Theatre Rank1 Rank2 Rank3 Rank4
Customer
100000000001 0 3 1 2 Dance Theatre Music Jazz
100000000002 0 1 6 2 Music Theatre Dance Jazz
100000000003 0 3 13 4 Music Theatre Dance Jazz
100000000004 0 5 4 1 Dance Music Theatre Jazz
100000000005 1 10 16 14 Music Theatre Dance Jazz
Use rank
and reshape dataframe, then join back to original dataframe using assign
.
The above solution works, but we now get the below deprecation warning.
r = pd.DataFrame(df.columns[i], index=df.index, columns=range(1, i.shape[1] + 1))
FutureWarning: Support for multi-dimensional indexing (e.g. obj[:, None]
) is deprecated and will be removed in a future version. Convert to a numpy
array before indexing instead.
Revised: r = pd.DataFrame(np.array(df.columns)[i], index=df.index, columns=range(1, i.shape[1] + 1))
Here is a function that improves the previous answers, considering the following:
- It solves the deprecation warning mentioned by Wally, by converting the df.columns into a numpy array before indexing them.
- It also allows including NaN values and avoids using those columns for the rank columns (leaving their values as NaN too). Check the example.
- It also adds the corresponding rank values to map them easily.
- Has an additional parameter in case you want to rank them in ascending or descending order.
- Adds an additional column specifying which columns had NaN values and were not included in the rank columns. Those values are added in a list.
# Example DataFrame
import numpy as np
import pandas as pd
dic = {'A': [0, np.nan, 2, np.nan],
'B': [3, 0, 1, 5],
'C': [1, 2, 0, np.nan]}
df = pd.DataFrame(dic)
print(df)
A B C
0 0.0 3 1.0
1 NaN 0 2.0
2 2.0 1 0.0
3 NaN 5 NaN
# Function
def fun_rank_columns(df, ascending=False):
factor = 1 if ascending else -1
# Rank columns showing ranking of column names
np_sort = np.argsort(df.to_numpy() * factor, axis=1)
df_rank = pd.DataFrame(np.array(df.columns)[np_sort], index=df.index, columns=range(1, np_sort.shape[1] + 1))
# Corresponding values for each rank column
np_sort_value = np.sort(df.to_numpy() * factor, axis=1)
df_rank_value = pd.DataFrame(np_sort_value, index=df.index, columns=range(1, np_sort_value.shape[1] + 1)) * factor
# Columns with nan values to be replaced
num_col_rank = df_rank.shape[1]
df_rank['nan_value'] = df.apply(lambda row: [i for i in df.columns if np.isnan(row[i])], axis=1)
for col in range(1, num_col_rank + 1):
condition = df_rank.apply(lambda x: x[col] in x['nan_value'], axis=1)
df_rank.loc[condition, col] = np.nan
df_rank_value.loc[condition, col] = np.nan
# Join Results
df_rank = df_rank.add_prefix('rank_')
df_rank_value = df_rank_value.add_prefix('rank_value_')
df_res = df_rank.join(df_rank_value)
return df_res
# Apply the function
df_res = fun_rank_columns(df, ascending=True)
print(df_res)
rank_1 rank_2 rank_3 rank_nan_value rank_value_1 rank_value_2 rank_value_3
0 A C B [] 0.0 1.0 3.0
1 B C NaN [A] 0.0 2.0 NaN
2 C B A [] 0.0 1.0 2.0
3 B NaN NaN [A, C] 5.0 NaN NaN
I have pivoted the Customer ID against their most frequently purchased genres of performances:
Genre Jazz Dance Music Theatre
Customer
100000000001 0 3 1 2
100000000002 0 1 6 2
100000000003 0 3 13 4
100000000004 0 5 4 1
100000000005 1 10 16 14
My desired result is to append the column names according to the rankings:
Genre Jazz Dance Music Theatre Rank1 Rank2 Rank3 Rank4
Customer
100000000001 0 3 1 2 Dance Theatre Music Jazz
100000000002 0 1 6 2 Music Theatre Dance Jazz
100000000003 0 3 13 4 Music Theatre Dance Jazz
100000000004 0 5 4 1 Dance Music Theatre Jazz
100000000005 1 10 16 14 Music Theatre Dance Jazz
I have looked up some threads but the closest thing I can find is idxmax
. However that only gives me Rank1
.
Could anyone help me to get the result I need?
Thanks a lot!
Dennis
Use:
i = np.argsort(df.to_numpy() * -1, axis=1)
r = pd.DataFrame(df.columns[i], index=df.index, columns=range(1, i.shape[1] + 1))
df = df.join(r.add_prefix('Rank'))
Details:
Use np.argsort
along axis=1
to get the indices i
that would sort the genres in descending order.
print(i)
array([[1, 3, 2, 0],
[2, 3, 1, 0],
[2, 3, 1, 0],
[1, 2, 3, 0],
[2, 3, 1, 0]])
Create a new dataframe r
from the columns of dataframe df
taken along the indices i
(i.e df.columns[i]
), then use DataFrame.join
to join the dataframe r
with df
:
print(df)
Jazz Dance Music Theatre Rank1 Rank2 Rank3 Rank4
Customer
100000000001 0 3 1 2 Dance Theatre Music Jazz
100000000002 0 1 6 2 Music Theatre Dance Jazz
100000000003 0 3 13 4 Music Theatre Dance Jazz
100000000004 0 5 4 1 Dance Music Theatre Jazz
100000000005 1 10 16 14 Music Theatre Dance Jazz
Let’s try stack
, cumcount
and sort_values
:
s = df.stack().sort_values(ascending=False).groupby(level=0).cumcount() + 1
s1 = (s.reset_index(1)
.set_index(0, append=True)
.unstack(1)
.add_prefix("Rank")
)
s1.columns = s1.columns.get_level_values(1)
then join back on your customer genre index.
df.join(s1)
Jazz Dance Music Theatre Rank1 Rank2 Rank3 Rank4
Customer_Genre
100000000001 0 3 1 2 Dance Theatre Music Jazz
100000000002 0 1 6 2 Music Theatre Dance Jazz
100000000003 0 3 13 4 Music Theatre Dance Jazz
100000000004 0 5 4 1 Dance Music Theatre Jazz
100000000005 1 10 16 14 Music Theatre Dance Jazz
Try this:
dfp = (df.rank(ascending=False, axis=1).stack()
.astype(int).rename('rank').reset_index(level=1))
df.assign(**dfp.set_index('rank', append=True)['Genre'].unstack().add_prefix('Rank'))
Output:
Genre Jazz Dance Music Theatre Rank1 Rank2 Rank3 Rank4
Customer
100000000001 0 3 1 2 Dance Theatre Music Jazz
100000000002 0 1 6 2 Music Theatre Dance Jazz
100000000003 0 3 13 4 Music Theatre Dance Jazz
100000000004 0 5 4 1 Dance Music Theatre Jazz
100000000005 1 10 16 14 Music Theatre Dance Jazz
Use rank
and reshape dataframe, then join back to original dataframe using assign
.
The above solution works, but we now get the below deprecation warning.
r = pd.DataFrame(df.columns[i], index=df.index, columns=range(1, i.shape[1] + 1))
FutureWarning: Support for multi-dimensional indexing (e.g. obj[:, None]
) is deprecated and will be removed in a future version. Convert to a numpy
array before indexing instead.
Revised: r = pd.DataFrame(np.array(df.columns)[i], index=df.index, columns=range(1, i.shape[1] + 1))
Here is a function that improves the previous answers, considering the following:
- It solves the deprecation warning mentioned by Wally, by converting the df.columns into a numpy array before indexing them.
- It also allows including NaN values and avoids using those columns for the rank columns (leaving their values as NaN too). Check the example.
- It also adds the corresponding rank values to map them easily.
- Has an additional parameter in case you want to rank them in ascending or descending order.
- Adds an additional column specifying which columns had NaN values and were not included in the rank columns. Those values are added in a list.
# Example DataFrame
import numpy as np
import pandas as pd
dic = {'A': [0, np.nan, 2, np.nan],
'B': [3, 0, 1, 5],
'C': [1, 2, 0, np.nan]}
df = pd.DataFrame(dic)
print(df)
A B C
0 0.0 3 1.0
1 NaN 0 2.0
2 2.0 1 0.0
3 NaN 5 NaN
# Function
def fun_rank_columns(df, ascending=False):
factor = 1 if ascending else -1
# Rank columns showing ranking of column names
np_sort = np.argsort(df.to_numpy() * factor, axis=1)
df_rank = pd.DataFrame(np.array(df.columns)[np_sort], index=df.index, columns=range(1, np_sort.shape[1] + 1))
# Corresponding values for each rank column
np_sort_value = np.sort(df.to_numpy() * factor, axis=1)
df_rank_value = pd.DataFrame(np_sort_value, index=df.index, columns=range(1, np_sort_value.shape[1] + 1)) * factor
# Columns with nan values to be replaced
num_col_rank = df_rank.shape[1]
df_rank['nan_value'] = df.apply(lambda row: [i for i in df.columns if np.isnan(row[i])], axis=1)
for col in range(1, num_col_rank + 1):
condition = df_rank.apply(lambda x: x[col] in x['nan_value'], axis=1)
df_rank.loc[condition, col] = np.nan
df_rank_value.loc[condition, col] = np.nan
# Join Results
df_rank = df_rank.add_prefix('rank_')
df_rank_value = df_rank_value.add_prefix('rank_value_')
df_res = df_rank.join(df_rank_value)
return df_res
# Apply the function
df_res = fun_rank_columns(df, ascending=True)
print(df_res)
rank_1 rank_2 rank_3 rank_nan_value rank_value_1 rank_value_2 rank_value_3
0 A C B [] 0.0 1.0 3.0
1 B C NaN [A] 0.0 2.0 NaN
2 C B A [] 0.0 1.0 2.0
3 B NaN NaN [A, C] 5.0 NaN NaN