Selecting dataframe first row based on specific columns and removing rows if the column value in specific column appeared previously
Question:
I have 3 dataframes
df1
CAT1 CAT2 CAT3 ID_X
A1 B C X1
A1 B C X2
A2 B C X3
A2 B C X4
A2 B C X5
A3 B C X6
A4 B C X7
df2
CAT1 CAT2 CAT3 ID_Y
A1 B C Y1
A1 B C Y2
A1 B C Y3
A2 B C Y4
A2 B C Y5
A3 B C Y6
A5 B C Y7
df3
ID_X ID_Y ID_XY
X1 Y1 X1Y1
X2 Y3 X2Y3
X3 Y4 X3Y4
X4 Y5 X4Y5
X6 Y6 X6Y6
There are three steps to get the end result
Step1:Inner join df1 and df2 on CAT1, CAT2 and CAT3 and create the ID_XY column based on ID_X & ID_Y to get a new dataframe df_merge
Script
df_merge = pd.merge(df1, df2, how="inner", on=["CAT1", "CAT2","CAT3"])
df_merge['ID_XY'] = df_merge['ID_X']+merge1['ID_Y']
Step2: Remove rows with common ID_XY from df_merge using df3
Script
df_merge1 = df_merge[~df_merge.ID_XY.isin(df3.ID_XY)]
df_merge1
CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
A1 B C X1 Y2 X1Y2
A1 B C X1 Y3 X1Y3
A1 B C X2 Y1 X2Y1
A1 B C X2 Y2 X2Y2
A2 B C X3 Y5 X3Y5
A2 B C X4 Y4 X4Y4
A2 B C X5 Y4 X5Y4
A2 B C X5 Y5 X5Y5
Step3: Selecting dataframe first row based on CAT1, CAT2, CAT3, ID_X, and ID_Y and removing rows if the column value in ID_Y appeared previously
Final output would be the end result of Step3 : The output looks like below
df_final
CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
A1 B C X1 Y2 X1Y2
A1 B C X2 Y1 X2Y1
A2 B C X3 Y5 X3Y5
A2 B C X4 Y4 X4Y4
Edit 1
Note:
Think ID_X as JOB and ID_Y as candidate ! In df_merge1 , If I am selecting Y2 for X1 given other columns constant then I can’t select Y3 also for same X1. Similary if X4 is allocated to Y4 given same CAT1,CAT2 and CAT3. Then Y4 can’t be allocated to X5.
Edit 2:
My try
merge3 = df_merge1.copy()
df_X1 = merge3[merge3['ID_X']=='X1']
df_X2 = merge3[merge3['ID_X']=='X2']
df_X3 = merge3[merge3['ID_X']=='X3']
df_X4 = merge3[merge3['ID_X']=='X4']
df_X5 = merge3[merge3['ID_X']=='X5']
selected_list = []
df_X1 = df_X1.iloc[:1]
selected_list.append(df_X1['ID_Y'].values[0])
df_X2 = df_X2[~df_X2.ID_Y.isin(selected_list)].iloc[:1]
selected_list.append(df_X2['ID_Y'].values[0])
df_X3 = df_X3[~df_X3.ID_Y.isin(selected_list)].iloc[:1]
selected_list.append(df_X3['ID_Y'].values[0])
df_X4 = df_X4[~df_X4.ID_Y.isin(selected_list)].iloc[:1]
selected_list.append(df_X4['ID_Y'].values[0])
df_X5 = df_X5[~df_X5.ID_Y.isin(selected_list)].iloc[:1]
df_output = pd.concat([df_X1,df_X2,df_X3,df_X4,df_X5])
Any help will be really appreciated
Answers:
Before answering the question, why wouldn’t you keep the second line of df_merge1 :
CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
A1 B C X1 Y3 X1Y3
Y3 (for ID_Y column) did not appear before, so shouldn’t it be kept ?
If need remove duplicates by CAT1/CAT2/CAT3/ID_X
and then by ID_Y
use:
df = df_merge1.drop_duplicates(['CAT1','CAT2','CAT3','ID_X']).drop_duplicates('ID_Y')
print (df)
CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
0 A1 B C X1 Y2 X1Y2
2 A1 B C X2 Y1 X2Y1
4 A2 B C X3 Y5 X3Y5
5 A2 B C X4 Y4 X4Y4
Or:
df = df_merge1.drop_duplicates(['CAT1','CAT2','CAT3','ID_Y']).drop_duplicates('ID_X')
print (df)
CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
0 A1 B C X1 Y2 X1Y2
2 A1 B C X2 Y1 X2Y1
4 A2 B C X3 Y5 X3Y5
5 A2 B C X4 Y4 X4Y4
I have 3 dataframes
df1
CAT1 CAT2 CAT3 ID_X
A1 B C X1
A1 B C X2
A2 B C X3
A2 B C X4
A2 B C X5
A3 B C X6
A4 B C X7
df2
CAT1 CAT2 CAT3 ID_Y
A1 B C Y1
A1 B C Y2
A1 B C Y3
A2 B C Y4
A2 B C Y5
A3 B C Y6
A5 B C Y7
df3
ID_X ID_Y ID_XY
X1 Y1 X1Y1
X2 Y3 X2Y3
X3 Y4 X3Y4
X4 Y5 X4Y5
X6 Y6 X6Y6
There are three steps to get the end result
Step1:Inner join df1 and df2 on CAT1, CAT2 and CAT3 and create the ID_XY column based on ID_X & ID_Y to get a new dataframe df_merge
Script
df_merge = pd.merge(df1, df2, how="inner", on=["CAT1", "CAT2","CAT3"])
df_merge['ID_XY'] = df_merge['ID_X']+merge1['ID_Y']
Step2: Remove rows with common ID_XY from df_merge using df3
Script
df_merge1 = df_merge[~df_merge.ID_XY.isin(df3.ID_XY)]
df_merge1
CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
A1 B C X1 Y2 X1Y2
A1 B C X1 Y3 X1Y3
A1 B C X2 Y1 X2Y1
A1 B C X2 Y2 X2Y2
A2 B C X3 Y5 X3Y5
A2 B C X4 Y4 X4Y4
A2 B C X5 Y4 X5Y4
A2 B C X5 Y5 X5Y5
Step3: Selecting dataframe first row based on CAT1, CAT2, CAT3, ID_X, and ID_Y and removing rows if the column value in ID_Y appeared previously
Final output would be the end result of Step3 : The output looks like below
df_final
CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
A1 B C X1 Y2 X1Y2
A1 B C X2 Y1 X2Y1
A2 B C X3 Y5 X3Y5
A2 B C X4 Y4 X4Y4
Edit 1
Note:
Think ID_X as JOB and ID_Y as candidate ! In df_merge1 , If I am selecting Y2 for X1 given other columns constant then I can’t select Y3 also for same X1. Similary if X4 is allocated to Y4 given same CAT1,CAT2 and CAT3. Then Y4 can’t be allocated to X5.
Edit 2:
My try
merge3 = df_merge1.copy()
df_X1 = merge3[merge3['ID_X']=='X1']
df_X2 = merge3[merge3['ID_X']=='X2']
df_X3 = merge3[merge3['ID_X']=='X3']
df_X4 = merge3[merge3['ID_X']=='X4']
df_X5 = merge3[merge3['ID_X']=='X5']
selected_list = []
df_X1 = df_X1.iloc[:1]
selected_list.append(df_X1['ID_Y'].values[0])
df_X2 = df_X2[~df_X2.ID_Y.isin(selected_list)].iloc[:1]
selected_list.append(df_X2['ID_Y'].values[0])
df_X3 = df_X3[~df_X3.ID_Y.isin(selected_list)].iloc[:1]
selected_list.append(df_X3['ID_Y'].values[0])
df_X4 = df_X4[~df_X4.ID_Y.isin(selected_list)].iloc[:1]
selected_list.append(df_X4['ID_Y'].values[0])
df_X5 = df_X5[~df_X5.ID_Y.isin(selected_list)].iloc[:1]
df_output = pd.concat([df_X1,df_X2,df_X3,df_X4,df_X5])
Any help will be really appreciated
Before answering the question, why wouldn’t you keep the second line of df_merge1 :
CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
A1 B C X1 Y3 X1Y3
Y3 (for ID_Y column) did not appear before, so shouldn’t it be kept ?
If need remove duplicates by CAT1/CAT2/CAT3/ID_X
and then by ID_Y
use:
df = df_merge1.drop_duplicates(['CAT1','CAT2','CAT3','ID_X']).drop_duplicates('ID_Y')
print (df)
CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
0 A1 B C X1 Y2 X1Y2
2 A1 B C X2 Y1 X2Y1
4 A2 B C X3 Y5 X3Y5
5 A2 B C X4 Y4 X4Y4
Or:
df = df_merge1.drop_duplicates(['CAT1','CAT2','CAT3','ID_Y']).drop_duplicates('ID_X')
print (df)
CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
0 A1 B C X1 Y2 X1Y2
2 A1 B C X2 Y1 X2Y1
4 A2 B C X3 Y5 X3Y5
5 A2 B C X4 Y4 X4Y4