What is the point in using PySpark over Pandas?

Question:

I’ve been learning Spark recently (PySpark to be more precise) and at first it seemed really useful and powerful to me. Like you can process Gb of data in parallel so it can me much faster than processing it with classical tool… right ? So I wanted to try by myself to be convinced.

So I downloaded a csv file of almost 1GB, ~ten millions of rows (link :https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz) and wanted to try to process it with Spark and with Pandas to see the difference.

So the goal was just to read the file and count of many rows were there for a certain date. I tried with PySpark :

Preprocess with PySpark

and with pandas :

Preprocess with Pandas

Which obviously gives the same result, but it take about 1mn30 for PySpark and only (!) about 30s for Pandas.

I feel like I missed something but I don’t know what. Why does it take much more time with PySpark ? Shouldn’t be the contrary ?

EDIT : I did not show my Spark configuration, but I am just using it locally so maybe this can be the explanation ?

Asked By: WLD

||

Answers:

Spark is a distributed processing framework. That means that, in order to use it at it’s full potential, you must deploy it on a cluster of machines (called nodes): the processing is then parallelized and distributed across them. This usually happens on cloud platforms like Google Cloud or AWS. Another interesting option to check out is Databricks.

If you use it on your local machine it would run on a single node, therefore it will be just a worse version of Pandas. That’s fine for learning purposes but it’s not the way it is meant to be used.

For more informations about how a Spark cluster works check the documentation: https://spark.apache.org/docs/latest/cluster-overview.html
Keep in mind that is a very deep topic, and it would take a while to decently understand everything…

Answered By: Liutprand