PySpark executing queries from different processes
Question:
Is there any way to have two separate processes executing queries on Spark? Something like:
def process_1():
spark_context = SparkSession.builder.getOrCreate()
data1 = spark_context.sql("SELECT * FROM table 1").toPandas()
do_processing(data1)
def process_2():
spark_context = SparkSession.builder.getOrCreate()
data1 = spark_context.sql("SELECT * FROM table 2").toPandas()
do_processing(data1)
p1 = Process(target=process_1)
p1.start()
p2 = Process(target=process_2)
p2.start()
p1.join()
p2.join()
The problem is how to create separate SparkContexts for processes or how to pass one context between processes?
Answers:
PySpark holds its Context as a singleton object.
Only one SparkContext
should be active per JVM. You must stop()
the active SparkContext
before creating a new one.
SparkContext
instance is not supported to share across multiple
processes out of the box, and PySpark does not guarantee
multi-processing execution. Use threads instead for concurrent
processing purpose.
As for your "out of memory" problem (in your code): that could be caused by DF.toPandas()
which significantly increases memory usage.
Consider writing the loaded data into parquet files and optimize computations with pyarrow functionality.
Is there any way to have two separate processes executing queries on Spark? Something like:
def process_1():
spark_context = SparkSession.builder.getOrCreate()
data1 = spark_context.sql("SELECT * FROM table 1").toPandas()
do_processing(data1)
def process_2():
spark_context = SparkSession.builder.getOrCreate()
data1 = spark_context.sql("SELECT * FROM table 2").toPandas()
do_processing(data1)
p1 = Process(target=process_1)
p1.start()
p2 = Process(target=process_2)
p2.start()
p1.join()
p2.join()
The problem is how to create separate SparkContexts for processes or how to pass one context between processes?
PySpark holds its Context as a singleton object.
Only one
SparkContext
should be active per JVM. You muststop()
the activeSparkContext
before creating a new one.
SparkContext
instance is not supported to share across multiple
processes out of the box, and PySpark does not guarantee
multi-processing execution. Use threads instead for concurrent
processing purpose.
As for your "out of memory" problem (in your code): that could be caused by DF.toPandas()
which significantly increases memory usage.
Consider writing the loaded data into parquet files and optimize computations with pyarrow functionality.