Dealing with very small static tables in pySpark

Question:

I am currently using Databricks to process data coming from our Azure Data Lake. Majority of the data is being read into pySpark dataframes and are relatively big datasets. However I do have to perform some joins on smaller static tables to fetch additional attributes.

Currently, the only way in which I can do this is by converting those smaller static tables into pySpark dataframes as well. I’m just curious as to whether using such a small table as a pySpark dataframe is a bad practice? I know pySpark is meant for large datasets which need to be distributed but given that my large dataset is in a pySpark dataframe, I assumed I would have to convert the smaller static table into a pySpark dataframe as well in order to make the appropriate joins.

Any tips on best practices would be appreciated, as it relates to joining with very small datasets. Maybe I am overcomplicating something which isn’t even a big deal but I was curious. Thanks in advance!

Asked By: Boris D'Souza

||

Answers:

Take a look at Broadcast joins. Wonderfully explained here https://mungingdata.com/apache-spark/broadcast-joins/

Answered By: bastian

The best practice in your case is to broadcast your small df and joins the broadcasted df to your large df like this code below:

val broadcastedDF = sc.broadcast(smallDF)
largeDF.join(broadcastedDF)
Answered By: Hossein Torabi