What is the difference between querying tables using Delta format with Pyspark-SQL versus Pyspark?
Question:
I am querying tables but I have different results using two manners, I would like to understand the reason.
I created a table using Delta location. I want to query the data that I stored in that location. I’m using Amazon S3.
I created the table like this:
spark.sql("CREATE TABLE bronze_client_trackingcampaigns.TRACKING_BOUNCES (ClientID INT, SendID INT, SubscriberKey STRING) USING DELTA LOCATION 's3://example/bronze/client/trackingcampaigns/TRACKING_BOUNCES/delta'")
I want to query the data using the next line:
spark.sql("SELECT count(*) FROM bronze_client_trackingcampaigns.TRACKING_BOUNCES")
But the results is not okay, it should be 41832 but it returns 1.
When I did the same query in other way:
spark.read.option("header", True).option("inferSchema", True).format("delta").table("bronze_client_trackingcampaigns.TRACKING_BOUNCES").count()
I obtained the result 41832.
My current results are:
I want to have the same results in both ways.
Answers:
The 1 you got back is actually the row count – not the actual result. Change the sql statement to be:
df = spark.sql("SELECT COUNT(*) FROM bronze_client_trackingcampaigns.TRACKING_BOUNCES")
df.show()
You should now get the same result.
I am querying tables but I have different results using two manners, I would like to understand the reason.
I created a table using Delta location. I want to query the data that I stored in that location. I’m using Amazon S3.
I created the table like this:
spark.sql("CREATE TABLE bronze_client_trackingcampaigns.TRACKING_BOUNCES (ClientID INT, SendID INT, SubscriberKey STRING) USING DELTA LOCATION 's3://example/bronze/client/trackingcampaigns/TRACKING_BOUNCES/delta'")
I want to query the data using the next line:
spark.sql("SELECT count(*) FROM bronze_client_trackingcampaigns.TRACKING_BOUNCES")
But the results is not okay, it should be 41832 but it returns 1.
When I did the same query in other way:
spark.read.option("header", True).option("inferSchema", True).format("delta").table("bronze_client_trackingcampaigns.TRACKING_BOUNCES").count()
I obtained the result 41832.
My current results are:
I want to have the same results in both ways.
The 1 you got back is actually the row count – not the actual result. Change the sql statement to be:
df = spark.sql("SELECT COUNT(*) FROM bronze_client_trackingcampaigns.TRACKING_BOUNCES")
df.show()
You should now get the same result.