Subtract 2 pyspark dataframes based on column

Question:

I have 2 pyspark dataframes,

i
+---+-----+
| ID|COL_A|
+---+-----+
|  1|  123|
|  2|  456|
|  3|  111|
|  4|  678|
+---+-----+
j
+----+-----+
|ID_B|COL_B|
+----+-----+
|   2|  456|
|   3|  111|
|   4|  876|
+----+-----+

I’m trying to subtract i from j based on values of a particular column i.e., values present in COL_A of i should not be present in COL_B of j.

Expected output should be,

diff
+---+-----+
| ID|COL_A|
+---+-----+
|  1|  123|
|  4|  678|
+---+-----+

This is my code,

common = i.join(j.withColumnRenamed('COL_B', 'COL_A'), ['COL_A'], 'leftsemi')
diff = i.subtract(common)
diff.show()

But the output is coming wrong,

diff
+---+-----+
| ID|COL_A|
+---+-----+
|  2|  456|
|  1|  123|
|  4|  678|
|  3|  111|
+---+-----+

Am I doing something wrong here? Thanks in advance.

Asked By: Phillip

||

Answers:

Try:

left_join = i.join(j, j.COL_B == i.COL_A,how='left')
left_join.filter(left_join.COL_A.isNull()).show()

If you are having column names as args, you can do like:

left_join = i.join(j, j[colb] == i[cola],how='left')
left_join.filter(left_join[cola].isNull()).show()
Answered By: Mayank Porwal
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.