How to use new Spark Context

Question:

I am currently running a jupyter notebook on GCP dataproc and hoping to increase the memory available via my config:

I first stopped my spark context:

import pyspark

sc = spark.sparkContext
sc.stop()

Waited until running the next code block so sc.stop() can finish

conf = pyspark.SparkConf().setAll([('spark.driver.maxResultSize','8g')])
sc = pyspark.SparkContext(conf=conf)

However when I run data = spark.read.parquet('link to data bucket'), it raises a

Py4JJavaError: An error occurred while calling o152.parquet.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:
...

The currently active SparkContext was created at:
...

The line above runs fine if I use the spark context originally provided when starting up a new pyspark notebook. The error implies that though I created a new spark context, whenever I call methods via spark it is still pointing towards the old context. How would I go about using the new SparkContext I created?

Asked By: Curl

||

Answers:

You’ve created a SparkContext, not a new SparkSession.

You will need to use spark = SparkSession.builder.config(key, value).getOrCreate() after stopping the context.

Alternatively (recommended) You should also be able to set PYSPARK_SUBMIT_ARGS='-c spark.driver.maxResultSize=8g' in the Notebook’s environment variables, and it should accomplish a similar goal.

aside: 8g for the notebook driver is a bit excessive. Perhaps you meant to change the executor memory? And your read parquet file’s dataframe would be distributed anyway, so I still don’t think you’ll need that much.

Answered By: OneCricketeer