PySpark executing queries from different processes

Question:

Is there any way to have two separate processes executing queries on Spark? Something like:

def process_1():
   spark_context = SparkSession.builder.getOrCreate()
   data1 = spark_context.sql("SELECT * FROM table 1").toPandas()
   do_processing(data1)


def process_2():
   spark_context = SparkSession.builder.getOrCreate()
   data1 = spark_context.sql("SELECT * FROM table 2").toPandas()
   do_processing(data1)

p1 = Process(target=process_1)
p1.start()
p2 = Process(target=process_2)
p2.start()

p1.join()
p2.join()

The problem is how to create separate SparkContexts for processes or how to pass one context between processes?

Asked By: Pawel

||

Answers:

PySpark holds its Context as a singleton object.

Only one SparkContext should be active per JVM. You must stop()
the active SparkContext before creating a new one.

SparkContext instance is not supported to share across multiple
processes out of the box, and PySpark does not guarantee
multi-processing execution. Use threads instead for concurrent
processing purpose.

As for your "out of memory" problem (in your code): that could be caused by DF.toPandas() which significantly increases memory usage.
Consider writing the loaded data into parquet files and optimize computations with pyarrow functionality.

Answered By: RomanPerekhrest