Return combination of pairs of columns in groupby with max values in another column

Question:

df = pd.DataFrame({'A':['AXQ00084.1', 'AXQ00134.1', 'AZI75768.1', 'AZI75768.1','AZI75801.1','AZI75801.1'],
                     'B':['AZI75768.1', 'AZI75768.1', 'AXQ00084.1', 'AXQ00134.1','AXQ00106.1','AXQ00107.1'],
                     'X': [607, 272, 595, 323,30,100],
                     'Y':[99.67213115, 83.23170732, 97.70491803, 98.7804878,99,99]})

I want to return the pairs of A and B which have the largest values of X and Y. For the df provided it would be the first and the last row. In the case where a column’s value repeats e.g. A,B: AZI75801.1, AXQ00106.1 and AXQ00106.1, AXQ00107.1, return the pair AZI75801.1, AXQ00107.1 as they have the largest X and Y value.

enter image description here

How I’m attempting it:

  1. Find combinations of A and B which have already appeared but are swapped
df2 = df.assign(swap=df.apply(lambda r: ((df.loc[(df.A.eq(r.B)&df.B.eq(r.A))].index.values)<r.name).any(), axis=1))
  1. Split into 2 tables based on pairs of values that have been seen before vs. not
df_1=df2[df2.swap]
df_0=df2[~df2.swap]
df_1=df_1.sort_values(['A','X','Y'],ascending=False).groupby('A').first()
df_0=df_0.sort_values(['A','X','Y'],ascending=False).groupby('B').first()
df_1.reset_index(inplace=True)
df_0.reset_index(inplace=True)
    
df_1_grp=df_1.sort_values(['A','X','Y'],ascending=False).groupby('A').first().reset_index()
df_0_grp=df_0.sort_values(['A','X','Y'],ascending=False).groupby('A').first().reset_index()
    
merge_df=pd.merge(df_1_grp, df_0_grp, left_on= ['A', 'B'],
                  right_on= ['B', 'A'],
                  how='outer',suffixes=('_1','_2'))
  1. Grouping to get the max value and then merging which gives:
    A_1     B_1     X_1     Y_1     swap_1   A_2        B_2         X_2      Y_2        swap_2
0   AZI75768.1  AXQ00084.1  595.0   97.704918   True    NaN     NaN     NaN     NaN     NaN
1   NaN     NaN     NaN     NaN     NaN     AXQ00134.1  AZI75768.1  272.0   83.231707   False
2   NaN     NaN     NaN     NaN     NaN     AZI75801.1  AXQ00107.1  100.0   99.000000   False

But this still doesn’t solve my problem as it returns the pair AXQ00134.1, AZI75768.1 which is incorrect as AZI75768.1 has the highest X,Y value with AZI75768.1.

Any help is greatly appreciated.

Extension of mozway’s answer. Thanks so much for making me conceptualize this as a graph! This solves my problem:

g= nx.from_pandas_edgelist(df, 'A', 'B', edge_attr='X')
g.edges(data=True)

sorted(g.edges(data=True),key= lambda x: x[2]['X'],reverse=True)
Asked By: skiventist

||

Answers:

IIUC, this is a graph problem. You have connected components and want to find the max of the disjoint subgraphs.

You can use networkx to identify the disjoint subgraphs and use this to form groups for pandas’ groupby:

import networkx as nx

G = nx.from_pandas_edgelist(df, source='A', target='B')

group = {k: v for v,s in enumerate(nx.connected_components(G)) for k in s}
# {'AZI75768.1': 0, 'AXQ00084.1': 0, 'AXQ00134.1': 0,
#  'AXQ00106.1': 1, 'AXQ00107.1': 1, 'AZI75801.1': 1}

out = (df
       .sort_values(by=['X', 'Y'])
       .groupby(df['A'].map(group))
       .last()
       )

Output:

            A           B    X          Y
A                                        
0  AXQ00084.1  AZI75768.1  607  99.672131
1  AZI75801.1  AXQ00107.1  100  99.000000

Your graph:

graph

Answered By: mozway
double_max = []
for (a, b), grouped_data in df.groupby(['A', 'B'])
   x_max_index= np.argmax(grouped_data['X']) 
   y_max_index= np.argmax(grouped_data['Y'])

   if x_max_index == y_max_index:
      max_row = grouped_data.iloc[x_max_index]
      double_max.append([a, b, max_row ['X'], max_row['Y'])
Answered By: Franco Milanese
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.