Pyspark, how often should I create new Spark session?

Question:

I have pipeline which looks like class with some methods. In each method I process some data. Example:

class Pipeline:

    def load_users(self):
        pass

    def load_sessions(self):
        pass

Should I initialize new spark session in every method with custom config? Or better to initialize its once in __init__ method?

Asked By: Slavka

||

Answers:

You can live with doing this once up front and changing Spark properties as you go through your various Actions / Pipelines, using spark.conf.set(“prop”, ‘val’). That is how most do and it there are few examples to be found to the contrary.

If you want better insight, then from the master himself: How many SparkSessions can a single application have?. This adds some insights which one could consider in relation to your question. Question is if you really need to consider this.

Answered By: thebluephantom