Selecting dataframe first row based on specific columns and removing rows if the column value in specific column appeared previously

Question:

I have 3 dataframes

df1

CAT1    CAT2    CAT3    ID_X
A1        B      C       X1
A1        B      C       X2
A2        B      C       X3
A2        B      C       X4
A2        B      C       X5
A3        B      C       X6
A4        B      C       X7

df2

CAT1    CAT2    CAT3    ID_Y
A1       B       C       Y1
A1       B       C       Y2
A1       B       C       Y3
A2       B       C       Y4
A2       B       C       Y5
A3       B       C       Y6
A5       B       C       Y7

df3

ID_X    ID_Y    ID_XY
X1      Y1      X1Y1
X2      Y3      X2Y3
X3      Y4      X3Y4
X4      Y5      X4Y5
X6      Y6      X6Y6

There are three steps to get the end result

Step1:Inner join df1 and df2 on CAT1, CAT2 and CAT3 and create the ID_XY column based on ID_X & ID_Y to get a new dataframe df_merge

Script

df_merge = pd.merge(df1, df2, how="inner", on=["CAT1", "CAT2","CAT3"])
df_merge['ID_XY'] = df_merge['ID_X']+merge1['ID_Y']

Step2: Remove rows with common ID_XY from df_merge using df3

Script

df_merge1 = df_merge[~df_merge.ID_XY.isin(df3.ID_XY)]

df_merge1

CAT1    CAT2    CAT3    ID_X    ID_Y    ID_XY
A1       B       C        X1     Y2     X1Y2
A1       B       C        X1     Y3     X1Y3
A1       B       C        X2     Y1     X2Y1
A1       B       C        X2     Y2     X2Y2
A2       B       C        X3     Y5     X3Y5
A2       B       C        X4     Y4     X4Y4
A2       B       C        X5     Y4     X5Y4
A2       B       C        X5     Y5     X5Y5

Step3: Selecting dataframe first row based on CAT1, CAT2, CAT3, ID_X, and ID_Y and removing rows if the column value in ID_Y appeared previously

Final output would be the end result of Step3 : The output looks like below

df_final

CAT1    CAT2    CAT3    ID_X    ID_Y    ID_XY
A1        B      C      X1       Y2     X1Y2    
A1        B      C      X2       Y1     X2Y1
A2        B      C      X3       Y5     X3Y5
A2        B      C      X4       Y4     X4Y4

Edit 1

Note:
Think ID_X as JOB and ID_Y as candidate ! In df_merge1 , If I am selecting Y2 for X1 given other columns constant then I can’t select Y3 also for same X1. Similary if X4 is allocated to Y4 given same CAT1,CAT2 and CAT3. Then Y4 can’t be allocated to X5.

Edit 2:

My try

merge3 = df_merge1.copy()
df_X1 = merge3[merge3['ID_X']=='X1']
df_X2 = merge3[merge3['ID_X']=='X2']
df_X3 = merge3[merge3['ID_X']=='X3']
df_X4 = merge3[merge3['ID_X']=='X4']
df_X5 = merge3[merge3['ID_X']=='X5']

selected_list = []

df_X1 = df_X1.iloc[:1]
selected_list.append(df_X1['ID_Y'].values[0])
df_X2 = df_X2[~df_X2.ID_Y.isin(selected_list)].iloc[:1]
selected_list.append(df_X2['ID_Y'].values[0])
df_X3 = df_X3[~df_X3.ID_Y.isin(selected_list)].iloc[:1]
selected_list.append(df_X3['ID_Y'].values[0])
df_X4 = df_X4[~df_X4.ID_Y.isin(selected_list)].iloc[:1]
selected_list.append(df_X4['ID_Y'].values[0])
df_X5 = df_X5[~df_X5.ID_Y.isin(selected_list)].iloc[:1]
df_output = pd.concat([df_X1,df_X2,df_X3,df_X4,df_X5])

Any help will be really appreciated

Asked By: AB14

||

Answers:

Before answering the question, why wouldn’t you keep the second line of df_merge1 :

CAT1    CAT2    CAT3    ID_X    ID_Y    ID_XY
A1       B       C        X1     Y3     X1Y3  

Y3 (for ID_Y column) did not appear before, so shouldn’t it be kept ?

Answered By: noeljbf

If need remove duplicates by CAT1/CAT2/CAT3/ID_X and then by ID_Y use:

df = df_merge1.drop_duplicates(['CAT1','CAT2','CAT3','ID_X']).drop_duplicates('ID_Y')
print (df)
  CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
0   A1    B    C   X1   Y2  X1Y2
2   A1    B    C   X2   Y1  X2Y1
4   A2    B    C   X3   Y5  X3Y5
5   A2    B    C   X4   Y4  X4Y4

Or:

df = df_merge1.drop_duplicates(['CAT1','CAT2','CAT3','ID_Y']).drop_duplicates('ID_X')
print (df)
  CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
0   A1    B    C   X1   Y2  X1Y2
2   A1    B    C   X2   Y1  X2Y1
4   A2    B    C   X3   Y5  X3Y5
5   A2    B    C   X4   Y4  X4Y4
Answered By: jezrael
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.