environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
Question:
I have installed pyspark recently. It was installed correctly. When I am using following simple program in python, I am getting an error.
>>from pyspark import SparkContext
>>sc = SparkContext()
>>data = range(1,1000)
>>rdd = sc.parallelize(data)
>>rdd.collect()
while running the last line I am getting error whose key line seems to be
[Stage 0:> (0 + 0) / 4]18/01/15 14:36:32 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I have the following variables in .bashrc
export SPARK_HOME=/opt/spark
export PYTHONPATH=$SPARK_HOME/python3
I am using Python 3.
Answers:
You should set the following environment variables in $SPARK_HOME/conf/spark-env.sh
:
export PYSPARK_PYTHON=/usr/bin/python
export PYSPARK_DRIVER_PYTHON=/usr/bin/python
If spark-env.sh
doesn’t exist, you can rename spark-env.sh.template
I got the same issue, and I set both variable in .bash_profile
export PYSPARK_PYTHON=/usr/local/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3
But My problem is still there.
Then I found out the problem is that my default python version is python 2.7 by typing python --version
So I solved the problem by following below page:
How to set Python's default version to 3.x on OS X?
I tried two methods for the question. the method in the picture can works.
PYSPARK_PYTHON=/usr/local/bin/python3.7;PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3.7;PYTHONUNBUFFERED=1
Apache-Spark 2.4.3 on Archlinux
I’ve just installed Apache-Spark-2.3.4
from Apache-Spark website, I’m using Archlinux distribution, it’s simple and lightweight distribution. So, I’ve installed and put the apache-spark
directory on /opt/apache-spark/
, now it’s time to export our environment variables, remember, I’m using Archlinux, so take in mind to using your $JAVA_HOME
for example.
Importing environment variables
echo 'export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/jre' >> /home/user/.bashrc
echo 'export SPARK_HOME=/opt/apache-spark' >> /home/user/.bashrc
echo 'export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH' >> /home/user/.bashrc
echo 'export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH' >> /home/user/.bashrc
source ../.bashrc
Testing
emanuel@hinton ~ $ echo 'export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/jre' >> /home/emanuel/.bashrc
emanuel@hinton ~ $ echo 'export SPARK_HOME=/opt/apache-spark' >> /home/emanuel/.bashrc
emanuel@hinton ~ $ echo 'export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH' >> /home/emanuel/.bashrc
emanuel@hinton ~ $ echo 'export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH' >> /home/emanuel/.bashrc
emanuel@hinton ~ $ source .bashrc
emanuel@hinton ~ $ python
Python 3.7.3 (default, Jun 24 2019, 04:54:02)
[GCC 9.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>>
Everything it’s working fine since you correctly imported the environment variables for SparkContext
.
Using Apache-Spark on Archlinux via DockerImage
For my use purposes I’ve created a Docker image with python
, jupyter-notebook
and apache-spark-2.3.4
running the image
docker run -ti -p 8888:8888 emanuelfontelles/spark-jupyter
just go to your browser and type
http://localhost:8888/tree
and will prompted a authentication page, come back to terminal and copy the token number and voila, will have Archlinux container running a Apache-Spark distribution.
Just run the code below in the very beginning of your code. I am using Python3.7. You might need to run locate python3.7
to get your Python path.
import os
os.environ["PYSPARK_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7"
I’m using Jupyter Notebook to study PySpark, and that’s what worked for me.
Find where python3
is installed doing in a terminal:
which python3
Here is pointing to /usr/bin/python3
.
Now in the the beginning of the notebook (or .py
script), do:
import os
# Set spark environments
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/bin/python3'
Restart your notebook session and it should works!
This may happen also if you’re working within an environment. In this case, it may be harder to retrieve the correct path to the python executable (and anyway I think it’s not a good idea to hardcode the path if you want to share it with others).
If you run the following lines at the beginning of your script/notebook (at least before you create the SparkSession/SparkContext) the problem is solved:
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
Package os
allows you to set global variables; package sys
gives the string with the absolute path of the executable binary for the Python interpreter.
To make it easier to see for people, that instead of having to set a specific path /usr/bin/python3 that you can do this:
I put this line in my ~/.zshrc
export PYSPARK_PYTHON=python3.8
export PYSPARK_DRIVER_PYTHON=python3.8
When I type in python3.8 in my terminal I get Python3.8 going. I think it’s because I installed pipenv.
Another good website to reference to get your SPARK_HOME is https://towardsdatascience.com/how-to-use-pyspark-on-your-computer-9c7180075617
(for permission denied issues use sudo mv)
1. Download and Install Java (Jre)
2. It has two options, you can choose one of the following solution:-
## ——– Temporary Solution ——– ##
Just put the path in your jupyter notebook in the following code and RUN IT EVERYTIME:-
import os
os.environ["PYSPARK_PYTHON"] = r"C:UsersLAPTOP0534miniconda3envspyspark_v3.3.0"
os.environ["PYSPARK_DRIVER_PYTHON"] = r"C:UsersLAPTOP0534miniconda3envspyspark_v3.3.0"
os.environ["JAVA_HOME"] = r"C:Program FilesJavajre1.8.0_333"
—-OR—-
## ——– Permanent Solution ——– ##
Set above 3 variables in your Environment Variable.
I have gone through many answers but nothing works for me.
But both of these resolution worked for me. This has resolved my error.
Thanks
import os
os.environ["JAVA_HOME"] = "C:Program FilesJavajdk-19"
os.environ["SPARK_HOME"] = "C:Program FilesSparkspark-3.3.1-bin-hadoop2"
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
This worked for me in a jupyter notebook as the os library makes our life easy in setting up the environment variables. Make sure to run this cell befor running the sparksession.
I have installed pyspark recently. It was installed correctly. When I am using following simple program in python, I am getting an error.
>>from pyspark import SparkContext
>>sc = SparkContext()
>>data = range(1,1000)
>>rdd = sc.parallelize(data)
>>rdd.collect()
while running the last line I am getting error whose key line seems to be
[Stage 0:> (0 + 0) / 4]18/01/15 14:36:32 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I have the following variables in .bashrc
export SPARK_HOME=/opt/spark
export PYTHONPATH=$SPARK_HOME/python3
I am using Python 3.
You should set the following environment variables in $SPARK_HOME/conf/spark-env.sh
:
export PYSPARK_PYTHON=/usr/bin/python
export PYSPARK_DRIVER_PYTHON=/usr/bin/python
If spark-env.sh
doesn’t exist, you can rename spark-env.sh.template
I got the same issue, and I set both variable in .bash_profile
export PYSPARK_PYTHON=/usr/local/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3
But My problem is still there.
Then I found out the problem is that my default python version is python 2.7 by typing python --version
So I solved the problem by following below page:
How to set Python's default version to 3.x on OS X?
I tried two methods for the question. the method in the picture can works.
PYSPARK_PYTHON=/usr/local/bin/python3.7;PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3.7;PYTHONUNBUFFERED=1
Apache-Spark 2.4.3 on Archlinux
I’ve just installed Apache-Spark-2.3.4
from Apache-Spark website, I’m using Archlinux distribution, it’s simple and lightweight distribution. So, I’ve installed and put the apache-spark
directory on /opt/apache-spark/
, now it’s time to export our environment variables, remember, I’m using Archlinux, so take in mind to using your $JAVA_HOME
for example.
Importing environment variables
echo 'export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/jre' >> /home/user/.bashrc
echo 'export SPARK_HOME=/opt/apache-spark' >> /home/user/.bashrc
echo 'export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH' >> /home/user/.bashrc
echo 'export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH' >> /home/user/.bashrc
source ../.bashrc
Testing
emanuel@hinton ~ $ echo 'export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/jre' >> /home/emanuel/.bashrc
emanuel@hinton ~ $ echo 'export SPARK_HOME=/opt/apache-spark' >> /home/emanuel/.bashrc
emanuel@hinton ~ $ echo 'export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH' >> /home/emanuel/.bashrc
emanuel@hinton ~ $ echo 'export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH' >> /home/emanuel/.bashrc
emanuel@hinton ~ $ source .bashrc
emanuel@hinton ~ $ python
Python 3.7.3 (default, Jun 24 2019, 04:54:02)
[GCC 9.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>>
Everything it’s working fine since you correctly imported the environment variables for SparkContext
.
Using Apache-Spark on Archlinux via DockerImage
For my use purposes I’ve created a Docker image with python
, jupyter-notebook
and apache-spark-2.3.4
running the image
docker run -ti -p 8888:8888 emanuelfontelles/spark-jupyter
just go to your browser and type
http://localhost:8888/tree
and will prompted a authentication page, come back to terminal and copy the token number and voila, will have Archlinux container running a Apache-Spark distribution.
Just run the code below in the very beginning of your code. I am using Python3.7. You might need to run locate python3.7
to get your Python path.
import os
os.environ["PYSPARK_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7"
I’m using Jupyter Notebook to study PySpark, and that’s what worked for me.
Find where python3
is installed doing in a terminal:
which python3
Here is pointing to /usr/bin/python3
.
Now in the the beginning of the notebook (or .py
script), do:
import os
# Set spark environments
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/bin/python3'
Restart your notebook session and it should works!
This may happen also if you’re working within an environment. In this case, it may be harder to retrieve the correct path to the python executable (and anyway I think it’s not a good idea to hardcode the path if you want to share it with others).
If you run the following lines at the beginning of your script/notebook (at least before you create the SparkSession/SparkContext) the problem is solved:
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
Package os
allows you to set global variables; package sys
gives the string with the absolute path of the executable binary for the Python interpreter.
To make it easier to see for people, that instead of having to set a specific path /usr/bin/python3 that you can do this:
I put this line in my ~/.zshrc
export PYSPARK_PYTHON=python3.8
export PYSPARK_DRIVER_PYTHON=python3.8
When I type in python3.8 in my terminal I get Python3.8 going. I think it’s because I installed pipenv.
Another good website to reference to get your SPARK_HOME is https://towardsdatascience.com/how-to-use-pyspark-on-your-computer-9c7180075617
(for permission denied issues use sudo mv)
1. Download and Install Java (Jre)
2. It has two options, you can choose one of the following solution:-
## ——– Temporary Solution ——– ##
Just put the path in your jupyter notebook in the following code and RUN IT EVERYTIME:-
import os
os.environ["PYSPARK_PYTHON"] = r"C:UsersLAPTOP0534miniconda3envspyspark_v3.3.0"
os.environ["PYSPARK_DRIVER_PYTHON"] = r"C:UsersLAPTOP0534miniconda3envspyspark_v3.3.0"
os.environ["JAVA_HOME"] = r"C:Program FilesJavajre1.8.0_333"
—-OR—-
## ——– Permanent Solution ——– ##
Set above 3 variables in your Environment Variable.
I have gone through many answers but nothing works for me.
But both of these resolution worked for me. This has resolved my error.
Thanks
import os
os.environ["JAVA_HOME"] = "C:Program FilesJavajdk-19"
os.environ["SPARK_HOME"] = "C:Program FilesSparkspark-3.3.1-bin-hadoop2"
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
This worked for me in a jupyter notebook as the os library makes our life easy in setting up the environment variables. Make sure to run this cell befor running the sparksession.