pySpark check Dataframe contains in another Dataframe

Question

Assume I have two Dataframes:

DF1: DATA1, DATA1, DATA2, DATA2

DF2: DATA2

I want to exclude all existence of data in DF2 while keeping duplicates in DF1, what should I do?

Expected result: DATA1, DATA1

Asked By: TommyQu

||

Answer 1

Use left anti
When you join two DataFrame using Left Anti Join (leftanti), it returns only columns from the left DataFrame for non-matched records.

df3 = df1.join(df2, df1['id']==df2['id'], how='left_anti')

Answered By: Rafa

Answer 2

df1.except(df2) will give you rows that present in df1 but not in df2

Question: