Python version different in worker and driver

Question:

The question I am trying to answer is:

Create RDD

Use the map to create an RDD of the NumPy arrays specified by the columns. The name of the RDD would be Rows

My code: Rows = df.select(col).rdd.map(make_array)

After I type this, I get a strange error, which basically says: Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

enter image description here

I know I am working in an environment with Python 3.6. I am not sure if this specific line of code is triggering this error? What do you think

Just to note, this isn’t my first line of code on this Jupyter notebook.
If you need more information, please let me know and I will provide it. I can’t understand why this is happening.

Asked By: Learning Everyday

||

Answers:

Your slaves and your driver are not using the same version of Python, which will trigger this error anytime you use Spark.

Make sure you have Python 3.6 installed on your slaves then (in Linux) modify your spark/conf/spark-env.sh file to add PYSPARK_PYTHON=/usr/local/lib/python3.6 (if this is the python directory in your slaves)

Answered By: Pierre Gourseaud

In a notebook recently, I had to add those lines at the beginning, to sync the python versions:

import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
Answered By: rfs
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.