Pyspark DataFrame Filter column based on a column in another DataFrame without join

Question

I have a pyspark dataframe called df1 that looks like this:

ID1	ID2
aaaa	a1
bbbb	a2
aaaa	a3
bbbb	a4
cccc	a2

And I have another dataframe called df2 that looks like this:

ID2_1	ID2_2
a2	a1
a3	a2
a2	a3
a2	a1

where the values of the ID2 in the first dataframe matches to the values in columns ID2_1, ID2_2 in the second dataframe.

So the resultant dataframe will look like this:

ID1	ID2
aaaa	a1
bbbb	a2
aaaa	a3
cccc	a2

(fourth line was filtered out)

I want to filter the column ID2 to contain only values that appear in one of the columns ID2_1 or ID2_2.
I tried doing

filter= df1.filter((f.col("ID2").isin(df2.ID2_1)))|
                   (f.col("ID2").isin(df2.ID2_2)))

But this doesn’t seem to work.
I have seen other suggestions to use a join between the two columns but this operation is way too heavy and I’m trying to avoid such actions. Any suggestions as to how to do this task?

Asked By: Ofek Glick

||

Source

Answer 1

Not sure why you would want to avoid join because it may as well be computationa;;y expensive.

Anyway

create a list of the df2 columns
append the distinct elements of 1 above in df2
Filter out where ID2 contains elements in 2 above.

Code below

new = (df1.withColumn('x', array_distinct(array(*[lit(x) for x in [item for sublist in g for item in sublist]]))).where(array_contains(col('x'), col('ID2'))).drop('x'))

new.show(truncate=False)

+----+---+
|ID1 |ID2|
+----+---+
|bbbb|a2 |
|aaaa|a3 |
|cccc|a2 |
+----+---+

Answered By: wwnde

Pyspark DataFrame Filter column based on a column in another DataFrame without join

Question:

Answers: