Comparing two dataframes on given columns

Question:

I have two dataframes DF1 and DF2 with different column sets

------------------------------------------------
DF1                                                 
col1 col2 col3  col4 col5   col6    col7 
---------------------------------------------           
Asr  Suh  dervi xyz  yes    NY      2022-04-11      
Bas  nav  dervi xyz  yes    CA      2022-04-11      
Naz  adn  otc   xyz   no    NJ      2022-05-01    

---------------------------------------------------

DF2
col1 col2 col3 col4 
------------------------------------------------------
Asr  Suh  dervi xyz  
Bas  nav  dervi xyz  
--------------------------------------------------------

I want to compare these two dataframes based on col2 and Col3 and filter out non matching rows from DF1.

So compare operation should give below row in dataset

df3
Naz  adn  otc  xyz   no NJ  2022-05-01
Asked By: Koushur

||

Answers:

There’s no left anti join in pandas, use outer join with indicator=True first.

cols = ['col2', 'col3']
# add _merge columns via outer join
_df = df1[cols].merge(df2[cols], how='outer', indicator=True)
_df
col2    col3    _merge
Suh     dervi   both
nav     dervi   both
adn     otc     left_only

# extract left_only, and drop _merge column
_df = _df[_df._merge=='left_only'].drop('_merge', axis=1)
_df
col2    col3
adn     otc

# inner join df1 and _df
df3 = df1.merge(_df, on=cols)
df3

col1    col2    col3    col4    col5    col6    col7
Naz     adn     otc     xyz     no      NJ      2022-05-01
Answered By: ndr

You can use the pyspark’s left anti join to get your desired result. When you join two DataFrame using Left Anti Join (leftanti), it returns only columns from the left DataFrame for non-matched records.

Preparing the dataframes,

data_1 = [('Asr',  'Suh', 'dervi', 'xyz', 'yes', 'NY', '2022-04-11'),
('Bas','nav','dervi','xyz', 'yes', 'CA', '2022-04-11'),
('Naz','adn','otc','xyz', 'no', 'NJ', '2022-05-01')]

col_1 = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']

df_1 = spark.createDataFrame(data=data_1, schema = col_1)

df_1.show()
+----+----+-----+----+----+----+----------+
|col1|col2| col3|col4|col5|col6|      col7|
+----+----+-----+----+----+----+----------+
| Asr| Suh|dervi| xyz| yes|  NY|2022-04-11|
| Bas| nav|dervi| xyz| yes|  CA|2022-04-11|
| Naz| adn|  otc| xyz|  no|  NJ|2022-05-01|
+----+----+-----+----+----+----+----------+
data_2 = [('Asr',  'Suh', 'dervi', 'xyz'),
('Bas','nav','dervi','xyz')]

col_2 = ['col1', 'col2', 'col3', 'col4']

df_2 = spark.createDataFrame(data=data_2, schema = col_2)

df_2.show()
+----+----+-----+----+
|col1|col2| col3|col4|
+----+----+-----+----+
| Asr| Suh|dervi| xyz|
| Bas| nav|dervi| xyz|
+----+----+-----+----+

So, in your case, you could try something like this,

df_1.join(df_2, (df_1.col2 == df_2.col2) & (df_1.col3 == df_2.col3), "leftanti").show(truncate=False)
+----+----+----+----+----+----+----------+
|col1|col2|col3|col4|col5|col6|col7      |
+----+----+----+----+----+----+----------+
|Naz |adn |otc |xyz |no  |NJ  |2022-05-01|
+----+----+----+----+----+----+----------+

or

df_1.join(df_2, ['col2', 'col3'], "leftanti").select(df_1.columns).show(truncate=False)
+----+----+----+----+----+----+----------+
|col1|col2|col3|col4|col5|col6|col7      |
+----+----+----+----+----+----+----------+
|Naz |adn |otc |xyz |no  |NJ  |2022-05-01|
+----+----+----+----+----+----+----------+

Both of them are the same just different syntax, use which ever you prefer.

Answered By: Tushar Patil
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.