Pyspark DataFrame Filter column based on a column in another DataFrame without join

Question:

I have a pyspark dataframe called df1 that looks like this:

ID1 ID2
aaaa a1
bbbb a2
aaaa a3
bbbb a4
cccc a2

And I have another dataframe called df2 that looks like this:

ID2_1 ID2_2
a2 a1
a3 a2
a2 a3
a2 a1

where the values of the ID2 in the first dataframe matches to the values in columns ID2_1, ID2_2 in the second dataframe.

So the resultant dataframe will look like this:

ID1 ID2
aaaa a1
bbbb a2
aaaa a3
cccc a2

(fourth line was filtered out)

I want to filter the column ID2 to contain only values that appear in one of the columns ID2_1 or ID2_2.
I tried doing

filter= df1.filter((f.col("ID2").isin(df2.ID2_1)))|
                   (f.col("ID2").isin(df2.ID2_2)))

But this doesn’t seem to work.
I have seen other suggestions to use a join between the two columns but this operation is way too heavy and I’m trying to avoid such actions. Any suggestions as to how to do this task?

Asked By: Ofek Glick

||

Answers:

Not sure why you would want to avoid join because it may as well be computationa;;y expensive.

Anyway

  1. create a list of the df2 columns
  2. append the distinct elements of 1 above in df2
  3. Filter out where ID2 contains elements in 2 above.

Code below

new = (df1.withColumn('x', array_distinct(array(*[lit(x) for x in [item for sublist in g for item in sublist]]))).where(array_contains(col('x'), col('ID2'))).drop('x'))

new.show(truncate=False)

+----+---+
|ID1 |ID2|
+----+---+
|bbbb|a2 |
|aaaa|a3 |
|cccc|a2 |
+----+---+
Answered By: wwnde
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.