importing pyspark in python shell
Question:
This is a copy of someone else’s question on another forum that was never answered, so I thought I’d re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736)
I have Spark installed properly on my machine and am able to run python programs with the pyspark modules without error when using ./bin/pyspark as my python interpreter.
However, when I attempt to run the regular Python shell, when I try to import pyspark modules I get this error:
from pyspark import SparkContext
and it says
"No module named pyspark".
How can I fix this? Is there an environment variable I need to set to point Python to the pyspark headers/libraries/etc.? If my spark installation is /spark/, which pyspark paths do I need to include? Or can pyspark programs only be run from the pyspark interpreter?
Answers:
Turns out that the pyspark bin is LOADING python and automatically loading the correct library paths. Check out $SPARK_HOME/bin/pyspark
:
export SPARK_HOME=/some/path/to/apache-spark
# Add the PySpark classes to the Python path:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
I added this line to my .bashrc file and the modules are now correctly found!
If it prints such error:
ImportError: No module named py4j.java_gateway
Please add $SPARK_HOME/python/build to PYTHONPATH:
export SPARK_HOME=/Users/pzhang/apps/spark-1.1.0-bin-hadoop2.4
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
On Mac, I use Homebrew to install Spark (formula “apache-spark”). Then, I set the PYTHONPATH this way so the Python import works:
export SPARK_HOME=/usr/local/Cellar/apache-spark/1.2.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
Replace the “1.2.0” with the actual apache-spark version on your mac.
Don’t run your py file as: python filename.py
instead use: spark-submit filename.py
Source: https://spark.apache.org/docs/latest/submitting-applications.html
By exporting the SPARK path and the Py4j path, it started to work:
export SPARK_HOME=/usr/local/Cellar/apache-spark/1.5.1
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
So, if you don’t want to type these everytime you want to fire up the Python shell, you might want to add it to your .bashrc
file
Assuming one of the following:
- Spark is downloaded on your system and you have an environment variable
SPARK_HOME
pointing to it
- You have ran
pip install pyspark
Here is a simple method (If you don’t bother about how it works!!!)
Use findspark
-
Go to your python shell
pip install findspark
import findspark
findspark.init()
-
import the necessary modules
from pyspark import SparkContext
from pyspark import SparkConf
-
Done!!!
I got this error because the python script I was trying to submit was called pyspark.py (facepalm). The fix was to set my PYTHONPATH as recommended above, then rename the script to pyspark_test.py and clean up the pyspark.pyc that was created based on my scripts original name and that cleared this error up.
In the case of DSE (DataStax Cassandra & Spark)
The following location needs to be added to PYTHONPATH
export PYTHONPATH=/usr/share/dse/resources/spark/python:$PYTHONPATH
Then use the dse pyspark to get the modules in path.
dse pyspark
I had this same problem and would add one thing to the proposed solutions above. When using Homebrew on Mac OS X to install Spark you will need to correct the py4j path address to include libexec in the path (remembering to change py4j version to the one you have);
PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.9-src.zip:$PYTHONPATH
To get rid of ImportError: No module named py4j.java_gateway
, you need to add following lines:
import os
import sys
os.environ['SPARK_HOME'] = "D:pythonspark-1.4.1-bin-hadoop2.4"
sys.path.append("D:pythonspark-1.4.1-bin-hadoop2.4python")
sys.path.append("D:pythonspark-1.4.1-bin-hadoop2.4pythonlibpy4j-0.8.2.1-src.zip")
try:
from pyspark import SparkContext
from pyspark import SparkConf
print ("success")
except ImportError as e:
print ("error importing spark modules", e)
sys.exit(1)
On Windows 10 the following worked for me. I added the following environment variables using Settings > Edit environment variables for your account:
SPARK_HOME=C:Programmingspark-2.0.1-bin-hadoop2.7
PYTHONPATH=%SPARK_HOME%python;%PYTHONPATH%
(change “C:Programming…” to the folder in which you have installed spark)
For Linux users, the following is the correct (and non-hard-coded) way of including the pyspark libaray in PYTHONPATH. Both PATH parts are necessary:
- The path to the pyspark Python module itself, and
- The path to the zipped library that that pyspark module relies on when imported
Notice below that the zipped library version is dynamically determined, so we do not hard-code it.
export PYTHONPATH=${SPARK_HOME}/python/:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}
I am running a spark cluster, on CentOS VM, which is installed from cloudera yum packages.
Had to set the following variables to run pyspark.
export SPARK_HOME=/usr/lib/spark;
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=/home/user/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
This is what I did for using my Anaconda distribution with Spark.
This is Spark version independent.
You can change the first line to your users’ python bin.
Also, as of Spark 2.2.0 PySpark is available as a Stand-alone package on PyPi
but I am yet to test it out.
I had the same problem.
Also make sure you are using right python version and you are installing it with right pip version. in my case: I had both python 2.7 and 3.x.
I have installed pyspark with
pip2.7 install pyspark
and it worked.
For a Spark execution in pyspark two components are required to work together:
pyspark
python package
- Spark instance in a JVM
When launching things with spark-submit or pyspark, these scripts will take care of both, i.e. they set up your PYTHONPATH, PATH, etc, so that your script can find pyspark, and they also start the spark instance, configuring according to your params, e.g. –master X
Alternatively, it is possible to bypass these scripts and run your spark application directly in the python interpreter likepython myscript.py
. This is especially interesting when spark scripts start to become more complex and eventually receive their own args.
- Ensure the pyspark package can be found by the Python interpreter. As already discussed either add the spark/python dir to PYTHONPATH or directly install pyspark using pip install.
- Set the parameters of spark instance from your script (those that used to be passed to pyspark).
- For spark configurations as you’d normally set with –conf they are defined with a config object (or string configs) in SparkSession.builder.config
- For main options (like –master, or –driver-mem) for the moment you can set them by writing to the PYSPARK_SUBMIT_ARGS environment variable. To make things cleaner and safer you can set it from within Python itself, and spark will read it when starting.
- Start the instance, which just requires you to call
getOrCreate()
from the builder object.
Your script can therefore have something like this:
from pyspark.sql import SparkSession
if __name__ == "__main__":
if spark_main_opts:
# Set main options, e.g. "--master local[4]"
os.environ['PYSPARK_SUBMIT_ARGS'] = spark_main_opts + " pyspark-shell"
# Set spark config
spark = (SparkSession.builder
.config("spark.checkpoint.compress", True)
.config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11")
.getOrCreate())
You can also create a Docker container with Alpine as the OS and the install Python and Pyspark as packages. That will have it all containerised.
In my case it was getting install at a different python dist_package (python 3.5) whereas I was using python 3.6,
so the below helped:
python -m pip install pyspark
You can get the pyspark path
in python using pip
(if you have installed pyspark using PIP) as below
pip show pyspark
!pip install pyspark
in Jupyter notebook or google colab. Do not forget to do Restart Runtime
listed on top of colab notebook
This is a copy of someone else’s question on another forum that was never answered, so I thought I’d re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736)
I have Spark installed properly on my machine and am able to run python programs with the pyspark modules without error when using ./bin/pyspark as my python interpreter.
However, when I attempt to run the regular Python shell, when I try to import pyspark modules I get this error:
from pyspark import SparkContext
and it says
"No module named pyspark".
How can I fix this? Is there an environment variable I need to set to point Python to the pyspark headers/libraries/etc.? If my spark installation is /spark/, which pyspark paths do I need to include? Or can pyspark programs only be run from the pyspark interpreter?
Turns out that the pyspark bin is LOADING python and automatically loading the correct library paths. Check out $SPARK_HOME/bin/pyspark
:
export SPARK_HOME=/some/path/to/apache-spark
# Add the PySpark classes to the Python path:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
I added this line to my .bashrc file and the modules are now correctly found!
If it prints such error:
ImportError: No module named py4j.java_gateway
Please add $SPARK_HOME/python/build to PYTHONPATH:
export SPARK_HOME=/Users/pzhang/apps/spark-1.1.0-bin-hadoop2.4
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
On Mac, I use Homebrew to install Spark (formula “apache-spark”). Then, I set the PYTHONPATH this way so the Python import works:
export SPARK_HOME=/usr/local/Cellar/apache-spark/1.2.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
Replace the “1.2.0” with the actual apache-spark version on your mac.
Don’t run your py file as: python filename.py
instead use: spark-submit filename.py
Source: https://spark.apache.org/docs/latest/submitting-applications.html
By exporting the SPARK path and the Py4j path, it started to work:
export SPARK_HOME=/usr/local/Cellar/apache-spark/1.5.1
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
So, if you don’t want to type these everytime you want to fire up the Python shell, you might want to add it to your .bashrc
file
Assuming one of the following:
- Spark is downloaded on your system and you have an environment variable
SPARK_HOME
pointing to it - You have ran
pip install pyspark
Here is a simple method (If you don’t bother about how it works!!!)
Use findspark
-
Go to your python shell
pip install findspark import findspark findspark.init()
-
import the necessary modules
from pyspark import SparkContext from pyspark import SparkConf
-
Done!!!
I got this error because the python script I was trying to submit was called pyspark.py (facepalm). The fix was to set my PYTHONPATH as recommended above, then rename the script to pyspark_test.py and clean up the pyspark.pyc that was created based on my scripts original name and that cleared this error up.
In the case of DSE (DataStax Cassandra & Spark)
The following location needs to be added to PYTHONPATH
export PYTHONPATH=/usr/share/dse/resources/spark/python:$PYTHONPATH
Then use the dse pyspark to get the modules in path.
dse pyspark
I had this same problem and would add one thing to the proposed solutions above. When using Homebrew on Mac OS X to install Spark you will need to correct the py4j path address to include libexec in the path (remembering to change py4j version to the one you have);
PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.9-src.zip:$PYTHONPATH
To get rid of ImportError: No module named py4j.java_gateway
, you need to add following lines:
import os
import sys
os.environ['SPARK_HOME'] = "D:pythonspark-1.4.1-bin-hadoop2.4"
sys.path.append("D:pythonspark-1.4.1-bin-hadoop2.4python")
sys.path.append("D:pythonspark-1.4.1-bin-hadoop2.4pythonlibpy4j-0.8.2.1-src.zip")
try:
from pyspark import SparkContext
from pyspark import SparkConf
print ("success")
except ImportError as e:
print ("error importing spark modules", e)
sys.exit(1)
On Windows 10 the following worked for me. I added the following environment variables using Settings > Edit environment variables for your account:
SPARK_HOME=C:Programmingspark-2.0.1-bin-hadoop2.7
PYTHONPATH=%SPARK_HOME%python;%PYTHONPATH%
(change “C:Programming…” to the folder in which you have installed spark)
For Linux users, the following is the correct (and non-hard-coded) way of including the pyspark libaray in PYTHONPATH. Both PATH parts are necessary:
- The path to the pyspark Python module itself, and
- The path to the zipped library that that pyspark module relies on when imported
Notice below that the zipped library version is dynamically determined, so we do not hard-code it.
export PYTHONPATH=${SPARK_HOME}/python/:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}
I am running a spark cluster, on CentOS VM, which is installed from cloudera yum packages.
Had to set the following variables to run pyspark.
export SPARK_HOME=/usr/lib/spark;
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=/home/user/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
This is what I did for using my Anaconda distribution with Spark.
This is Spark version independent.
You can change the first line to your users’ python bin.
Also, as of Spark 2.2.0 PySpark is available as a Stand-alone package on PyPi
but I am yet to test it out.
I had the same problem.
Also make sure you are using right python version and you are installing it with right pip version. in my case: I had both python 2.7 and 3.x.
I have installed pyspark with
pip2.7 install pyspark
and it worked.
For a Spark execution in pyspark two components are required to work together:
pyspark
python package- Spark instance in a JVM
When launching things with spark-submit or pyspark, these scripts will take care of both, i.e. they set up your PYTHONPATH, PATH, etc, so that your script can find pyspark, and they also start the spark instance, configuring according to your params, e.g. –master X
Alternatively, it is possible to bypass these scripts and run your spark application directly in the python interpreter likepython myscript.py
. This is especially interesting when spark scripts start to become more complex and eventually receive their own args.
- Ensure the pyspark package can be found by the Python interpreter. As already discussed either add the spark/python dir to PYTHONPATH or directly install pyspark using pip install.
- Set the parameters of spark instance from your script (those that used to be passed to pyspark).
- For spark configurations as you’d normally set with –conf they are defined with a config object (or string configs) in SparkSession.builder.config
- For main options (like –master, or –driver-mem) for the moment you can set them by writing to the PYSPARK_SUBMIT_ARGS environment variable. To make things cleaner and safer you can set it from within Python itself, and spark will read it when starting.
- Start the instance, which just requires you to call
getOrCreate()
from the builder object.
Your script can therefore have something like this:
from pyspark.sql import SparkSession
if __name__ == "__main__":
if spark_main_opts:
# Set main options, e.g. "--master local[4]"
os.environ['PYSPARK_SUBMIT_ARGS'] = spark_main_opts + " pyspark-shell"
# Set spark config
spark = (SparkSession.builder
.config("spark.checkpoint.compress", True)
.config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11")
.getOrCreate())
You can also create a Docker container with Alpine as the OS and the install Python and Pyspark as packages. That will have it all containerised.
In my case it was getting install at a different python dist_package (python 3.5) whereas I was using python 3.6,
so the below helped:
python -m pip install pyspark
You can get the pyspark path
in python using pip
(if you have installed pyspark using PIP) as below
pip show pyspark
!pip install pyspark
in Jupyter notebook or google colab. Do not forget to do Restart Runtime
listed on top of colab notebook