Databricks – Pyspark vs Pandas

Question:

I have a python script where I’m using pandas for transformations/manipulation of my data. I know I have some "inefficient" blocks of code. My question is, if pyspark is supposed to be much faster, can I just replace these blocks using pyspark instead of pandas or do I need everything to be in pyspark? If I’m in Databricks, how much does this really matter since it’s already on a spark cluster?

Asked By: chicagobeast12

||

Answers:

If the data is small enough that you can use pandas to process it, then you likely don’t need pyspark. Spark is useful when you have such large data sizes that it doesn’t fit into memory in one machine since it can perform distributed computation. That being said, if the computation is complex enough that it could benefit from a lot of parallelization, then you could see an efficiency boost using pyspark. I’m more comfortable with pyspark’s APIs than pandas, so I might end up using pyspark anyways, but whether you’ll see an efficiency boost depends a lot on the problem.

Answered By: rchome

Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is the best fit which could process operations many times(100x) faster than Pandas.

PySpark is very efficient for processing large datasets. But you can convert spark dataframe to Pandas dataframe after preprocessing and data exploration to train machine learning models using sklearn.

Answered By: Raha Moosavi

Let’s compare apples with apples please: pandas is not an alternative to pyspark, as pandas cannot do distributed computing and out-of-core computations. What you can pit Spark against is dask on Ray Core (see docs), and you don’t even have to learn a different API like you would with Spark, as Dask is intended be a distributed drop-in replacement for pandas and numpy (and so is Dask ML for popular ML packages such as scikit-learn and xgboost).

Answered By: mirekphd