PySpark – Compare DataFrames

Question:

I’m new to PySpark, So apoloigies if this is a little simple, I have found other questions that compare dataframes but not one that is like this, therefore I do not consider it to be a duplicate.
I’m trying to compare two dateframes with similar structure. The ‘name’ will be unique, yet the counts could be different.

So if the count is different I would like it to produce a dataframe or a python dictionary. just like below. Any ideas on how I would achieved something like this?

DF1

+-------+---------+
|name   | count_1 |
+-------+---------+
|  Alice|   1500  |
|    Bob|   1000  |
|Charlie|   150   |
| Dexter|   100   |
+-------+---------+

DF2

+-------+---------+
|name   | count_2 |
+-------+---------+
|  Alice|   1500  |
|    Bob|   200   |
|Charlie|   150   |
| Dexter|   10    |
+-------+---------+

To produce the outcome:

Mismatch

+-------+-------------+--------------+
|name   | df1_count   | df2_count    |
+-------+-------------+--------------+
|    Bob|   1000      |    200       |
| Dexter|   100       |     10       |
+-------+-------------+--------------+

Match

+-------+-------------+--------------+
|name   | df1_count   | df2_count    |
+-------+-------------+--------------+
|  Alice|   1500      |   1500       |
|Charlie|   150       |    150       |
+-------+-------------+--------------+
Asked By: user10176078

||

Answers:

So I create a third DataFrame, joining DataFrame1 and DataFrame2, and then filter by the counts fields to check if they are equal or not:

Mismatch:

df3 = df1.join(df2, [df1.name == df2.name] , how = 'inner' )
df3.filter(df3.df1_count != df3.df2_count).show()

Match:

df3 = df1.join(df2, [df1.name == df2.name] , how = 'inner' )
df3.filter(df3.df1_count == df3.df2_count).show()

Hope this comes in useful for someone

Answered By: user10176078

You can create a temp view on top of each dataframe and write Spark SQL query in order to do the join. Direct dataframe join and Spark SQL joins are two options that can be explored here. SQL is more intuitive so it might be easy for newbies.

Answered By: Prashant

MATCH

Df1.join(Df2,Df1.col(“name”) === Df2.col(“name”) && Df1.col(“count_1”) === Df2.col(“count_2”),”inner”).drop(Df2.col(“name”)).show

MISMATCH

Df1.join(Df2,Df1.col(“name”) === Df2.col(“name”) && Df1.col(“count_1”) === Df2.col(“count_2”),”leftanti”).as(“Df3”).join(Df2,Df2.col(“name”) === col(“Df3.name”),”inner”).drop(Df2.col(“name”)).show
Answered By: Chandan Ray

For small DataFrame comparisons, you can use the chispa library. This is particularly useful when performing DataFrame comparisons in a test suite. For big datasets, the accepted answer that uses a join is the best approach.

In this example, chispa.assert_df_equality(df1, df2), will output this error message:

enter image description here

The rows that mismatch are red and the rows that match are blue. This post has more info on testing PySpark code.

There’s a cool library called deequ that is good for "data unit tests", but I’m not sure if there is a PySpark implementation.

Answered By: Powers

The easy way is to use the diff transformation from spark-extension (Scala and Python):

from gresearch.spark.diff import *

left = spark.createDataFrame([("Alice", 1500), ("Bob", 1000), ("Charlie", 150), ("Dexter", 100)], ["name", "count"])
right = spark.createDataFrame([("Alice", 1500), ("Bob", 200), ("Charlie", 150), ("Dexter", 10)], ["name", "count"])

diff = left.diff(right, 'name')
diff.show()
+----+-------+----------+-----------+                                           
|diff|   name|left_count|right_count|
+----+-------+----------+-----------+
|   N|  Alice|      1500|       1500|
|   C|    Bob|      1000|        200|
|   N|Charlie|       150|        150|
|   C| Dexter|       100|         10|
+----+-------+----------+-----------+

This shows you mismatch (C) and match (N) in one DataFrame.

And, of course, you can filter to get mismatches and matches only:

diff.where(diff['diff'] == 'C').show()
+----+------+----------+-----------+                                            
|diff|  name|left_count|right_count|
+----+------+----------+-----------+
|   C|   Bob|      1000|        200|
|   C|Dexter|       100|         10|
+----+------+----------+-----------+

diff.where(diff['diff'] == 'N').show()
+----+-------+----------+-----------+                                           
|diff|   name|left_count|right_count|
+----+-------+----------+-----------+
|   N|  Alice|      1500|       1500|
|   N|Charlie|       150|        150|
+----+-------+----------+-----------+

While this is a simple example, diffing DataFrames can become complicated when wide schemas, insertions, deletions, null values, float or double values, white space changes are involved, which is fully supported by this solution.

Answered By: EnricoM

I just discovered a wonderful package for pyspark that compares two dataframes. The name of the package is datacompy

https://capitalone.github.io/datacompy/

example code:

import datacompy as dc

comparison = dc.SparkCompare(spark, base_df=df1, compare_df=df2, join_columns=common_keys, match_rates=True)
comparison.report()

The above code will generate a summary report, and the one below it will give you the mismatches.

comparison.rows_both_mismatch.display()

There are also more fearures that you can explore.

Answered By: George Sotiropoulos