pySpark check Dataframe contains in another Dataframe

Question:

Assume I have two Dataframes:

DF1: DATA1, DATA1, DATA2, DATA2

DF2: DATA2

I want to exclude all existence of data in DF2 while keeping duplicates in DF1, what should I do?

Expected result: DATA1, DATA1

Asked By: TommyQu

||

Answers:

Use left anti
When you join two DataFrame using Left Anti Join (leftanti), it returns only columns from the left DataFrame for non-matched records.

df3 = df1.join(df2, df1['id']==df2['id'], how='left_anti')
Answered By: Rafa

df1.except(df2) will give you rows that present in df1 but not in df2

enter image description here

credits: https://sanori.github.io/2019/08/Compare-Two-Tables-in-SQL/

Answered By: Ram Ghadiyaram
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.