What is the difference between querying tables using Delta format with Pyspark-SQL versus Pyspark?

Question

I am querying tables but I have different results using two manners, I would like to understand the reason.

I created a table using Delta location. I want to query the data that I stored in that location. I’m using Amazon S3.

I created the table like this:

spark.sql("CREATE TABLE bronze_client_trackingcampaigns.TRACKING_BOUNCES (ClientID INT, SendID INT, SubscriberKey STRING) USING DELTA LOCATION 's3://example/bronze/client/trackingcampaigns/TRACKING_BOUNCES/delta'")

I want to query the data using the next line:

spark.sql("SELECT count(*) FROM bronze_client_trackingcampaigns.TRACKING_BOUNCES")

But the results is not okay, it should be 41832 but it returns 1.

When I did the same query in other way:

spark.read.option("header", True).option("inferSchema", True).format("delta").table("bronze_client_trackingcampaigns.TRACKING_BOUNCES").count()

I obtained the result 41832.

My current results are:

I want to have the same results in both ways.

Asked By: Eric Gabriel Bellet Locker

||

Source

Answer 1

The 1 you got back is actually the row count – not the actual result. Change the sql statement to be:

df = spark.sql("SELECT COUNT(*) FROM bronze_client_trackingcampaigns.TRACKING_BOUNCES")
df.show()

You should now get the same result.

Answered By: simon_dmorias

What is the difference between querying tables using Delta format with Pyspark-SQL versus Pyspark?

Question:

Answers: