pySpark check Dataframe contains in another Dataframe
Question:
Assume I have two Dataframes:
DF1: DATA1, DATA1, DATA2, DATA2
DF2: DATA2
I want to exclude all existence of data in DF2 while keeping duplicates in DF1, what should I do?
Expected result: DATA1, DATA1
Answers:
Use left anti
When you join two DataFrame using Left Anti Join (leftanti), it returns only columns from the left DataFrame for non-matched records.
df3 = df1.join(df2, df1['id']==df2['id'], how='left_anti')
df1.except(df2)
will give you rows that present in df1 but not in df2
credits: https://sanori.github.io/2019/08/Compare-Two-Tables-in-SQL/
Assume I have two Dataframes:
DF1: DATA1, DATA1, DATA2, DATA2
DF2: DATA2
I want to exclude all existence of data in DF2 while keeping duplicates in DF1, what should I do?
Expected result: DATA1, DATA1
Use left anti
When you join two DataFrame using Left Anti Join (leftanti), it returns only columns from the left DataFrame for non-matched records.
df3 = df1.join(df2, df1['id']==df2['id'], how='left_anti')
df1.except(df2)
will give you rows that present in df1 but not in df2
credits: https://sanori.github.io/2019/08/Compare-Two-Tables-in-SQL/