environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON

Question:

I have installed pyspark recently. It was installed correctly. When I am using following simple program in python, I am getting an error.

>>from pyspark import SparkContext
>>sc = SparkContext()
>>data = range(1,1000)
>>rdd = sc.parallelize(data)
>>rdd.collect()

while running the last line I am getting error whose key line seems to be

[Stage 0:>                                                          (0 + 0) / 4]18/01/15 14:36:32 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

I have the following variables in .bashrc

export SPARK_HOME=/opt/spark
export PYTHONPATH=$SPARK_HOME/python3

I am using Python 3.

Asked By: Akash Kumar

||

Answers:

You should set the following environment variables in $SPARK_HOME/conf/spark-env.sh:

export PYSPARK_PYTHON=/usr/bin/python
export PYSPARK_DRIVER_PYTHON=/usr/bin/python

If spark-env.sh doesn’t exist, you can rename spark-env.sh.template

Answered By: Alex

By the way, if you use PyCharm, you could add PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to run/debug configurations per image below
enter image description here

Answered By: buxizhizhoum

I got the same issue, and I set both variable in .bash_profile

export PYSPARK_PYTHON=/usr/local/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3

But My problem is still there.

Then I found out the problem is that my default python version is python 2.7 by typing python --version

So I solved the problem by following below page:
How to set Python's default version to 3.x on OS X?

Answered By: Ruxi Zhang

I tried two methods for the question. the method in the picture can works.

add environment variables

PYSPARK_PYTHON=/usr/local/bin/python3.7;PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3.7;PYTHONUNBUFFERED=1

Answered By: Eric Cheng

Apache-Spark 2.4.3 on Archlinux

I’ve just installed Apache-Spark-2.3.4 from Apache-Spark website, I’m using Archlinux distribution, it’s simple and lightweight distribution. So, I’ve installed and put the apache-spark directory on /opt/apache-spark/, now it’s time to export our environment variables, remember, I’m using Archlinux, so take in mind to using your $JAVA_HOME for example.

Importing environment variables

echo 'export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/jre' >> /home/user/.bashrc
echo 'export SPARK_HOME=/opt/apache-spark'  >> /home/user/.bashrc
echo 'export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH'  >> /home/user/.bashrc
echo 'export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH'  >> /home/user/.bashrc
source ../.bashrc 

Testing

emanuel@hinton ~ $ echo 'export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/jre' >> /home/emanuel/.bashrc
emanuel@hinton ~ $ echo 'export SPARK_HOME=/opt/apache-spark'  >> /home/emanuel/.bashrc
emanuel@hinton ~ $ echo 'export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH'  >> /home/emanuel/.bashrc
emanuel@hinton ~ $ echo 'export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH'  >> /home/emanuel/.bashrc
emanuel@hinton ~ $ source .bashrc 
emanuel@hinton ~ $ python
Python 3.7.3 (default, Jun 24 2019, 04:54:02) 
[GCC 9.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> 

Everything it’s working fine since you correctly imported the environment variables for SparkContext.

Using Apache-Spark on Archlinux via DockerImage

For my use purposes I’ve created a Docker image with python, jupyter-notebook and apache-spark-2.3.4

running the image

docker run -ti -p 8888:8888 emanuelfontelles/spark-jupyter

just go to your browser and type

http://localhost:8888/tree

and will prompted a authentication page, come back to terminal and copy the token number and voila, will have Archlinux container running a Apache-Spark distribution.

Answered By: Emanuel Fontelles

Just run the code below in the very beginning of your code. I am using Python3.7. You might need to run locate python3.7 to get your Python path.

import os
os.environ["PYSPARK_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7"
Answered By: James Chang

I’m using Jupyter Notebook to study PySpark, and that’s what worked for me.
Find where python3 is installed doing in a terminal:

which python3

Here is pointing to /usr/bin/python3.
Now in the the beginning of the notebook (or .py script), do:

import os

# Set spark environments
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/bin/python3'

Restart your notebook session and it should works!

Answered By: igorkf

This may happen also if you’re working within an environment. In this case, it may be harder to retrieve the correct path to the python executable (and anyway I think it’s not a good idea to hardcode the path if you want to share it with others).

If you run the following lines at the beginning of your script/notebook (at least before you create the SparkSession/SparkContext) the problem is solved:

import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

Package os allows you to set global variables; package sys gives the string with the absolute path of the executable binary for the Python interpreter.

Answered By: Davide Frison

If you are using Pycharm , Got to Run – > Edit Configurations and click on Environment variables to add as below(basically the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON should point to the same version of Python) . This solution worked for me .Thanks to the above posts.
enter image description here

Answered By: RaHuL VeNuGoPaL

To make it easier to see for people, that instead of having to set a specific path /usr/bin/python3 that you can do this:

I put this line in my ~/.zshrc

export PYSPARK_PYTHON=python3.8
export PYSPARK_DRIVER_PYTHON=python3.8

When I type in python3.8 in my terminal I get Python3.8 going. I think it’s because I installed pipenv.

Another good website to reference to get your SPARK_HOME is https://towardsdatascience.com/how-to-use-pyspark-on-your-computer-9c7180075617
(for permission denied issues use sudo mv)

Answered By: S.Doe_Dude

1. Download and Install Java (Jre)
2. It has two options, you can choose one of the following solution:-

## ——– Temporary Solution ——– ##
Just put the path in your jupyter notebook in the following code and RUN IT EVERYTIME:-

import os

os.environ["PYSPARK_PYTHON"] = r"C:UsersLAPTOP0534miniconda3envspyspark_v3.3.0"

os.environ["PYSPARK_DRIVER_PYTHON"] = r"C:UsersLAPTOP0534miniconda3envspyspark_v3.3.0"

os.environ["JAVA_HOME"] = r"C:Program FilesJavajre1.8.0_333"  

—-OR—-

## ——– Permanent Solution ——– ##
Set above 3 variables in your Environment Variable.

Environment Variables

I have gone through many answers but nothing works for me.
But both of these resolution worked for me. This has resolved my error.
Thanks

Answered By: Shubham Tomar
import os
os.environ["JAVA_HOME"] = "C:Program FilesJavajdk-19"
os.environ["SPARK_HOME"] = "C:Program FilesSparkspark-3.3.1-bin-hadoop2"
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

This worked for me in a jupyter notebook as the os library makes our life easy in setting up the environment variables. Make sure to run this cell befor running the sparksession.

Answered By: Ajay krishna