How to start a standalone cluster using pyspark?
Question:
I am using pyspark under ubuntu with python 2.7
I installed it using
pip install pyspark --user
And trying to follow the instruction to setup spark cluster
I can’t find the script start-master.sh
I assume that it has to do with the fact that i installed pyspark and not regular spark
I found here that i can connect a worker node to the master via pyspark, but how do i start the master node with pyspark?
Answers:
https://pypi.python.org/pypi/pyspark
The Python packaging for Spark is not intended to replace all … use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) – but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.
Well i did a bit of a mix-up in the op.
You need to get spark on the machine that should run as master.
You can download it here
After extracting it, you have spark/sbin folder, there you have start-master.sh script. you need to start it with -h argument.
please note that you need to create a spark-env file like explained here and define the spark local and master variables, this is important on the master machine.
After that, on the worker nodes, use the start-slave.sh script to start worker nodes.
And you are good to go, you can use a spark context inside python to use it!
If you are already using pyspark through conda / pip installation, there’s no need to install Spark and setup environment variables again for cluster setup.
For conda / pip pyspark installation is missing only 'conf'
, 'sbin'
, 'kubernetes'
, 'yarn'
folders, You can simply download Spark and move those folders into the folder where pyspark is located (usually site-packages folder inside python).
After you installed pyspark via pip install pyspark
, you can start the Spark standalone cluster master process using this command:
spark-class org.apache.spark.deploy.master.Master -h 127.0.0.1
And then you can add some workers (executors), which would process the jobs:
spark-class org.apache.spark.deploy.worker.Worker
spark://127.0.0.1:7077
-c 4 -m 8G
Flags -c
and -m
specify the number of CPU cores and amount of memory provided by the worker.
The 127.0.0.1
local address is used there for security reasons (it isn’t good if anyone just copy/pasting this lines would expose an "arbitary code execution service" in their network), but for the distributed standalone Spark cluster setup the different address should be used (ex, a private IP address in an isolated network available only for this cluster nodes and their intended users, and an official Spark security guide should be read).
The spark-class
script is contained in the "pyspark" python package, and it is a wrapper to load the environment variables from spark-env.sh
and add the corresponding spark jars locations to -cp
flag of java
command.
If you may need to configure the environment – consult the official Spark docs, but it also works and may be suitable for the regular usage with default parameters. Also, see the flags for the master/worker commands using their --help
.
This is an example how to connect to this standalone cluster using pyspark
script with ipython
shell:
PYSPARK_DRIVER_PYTHON=ipython
pyspark --master spark://127.0.0.1:7077
--num-executors 2
--executor-cores 2
--executor-memory 4G
The code for instantiating spark session manually, ex. in Jupyter:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.master("spark://127.0.0.1:7077")
# the number of executors this job needs
.config("spark.executor.instances", 2)
# the number of CPU cores memory this needs from the executor,
# it would be reserved on the worker
.config("spark.executor.cores", "2")
.config("spark.executor.memory", "4G")
.getOrCreate()
)
I am using pyspark under ubuntu with python 2.7
I installed it using
pip install pyspark --user
And trying to follow the instruction to setup spark cluster
I can’t find the script start-master.sh
I assume that it has to do with the fact that i installed pyspark and not regular spark
I found here that i can connect a worker node to the master via pyspark, but how do i start the master node with pyspark?
https://pypi.python.org/pypi/pyspark
The Python packaging for Spark is not intended to replace all … use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) – but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.
Well i did a bit of a mix-up in the op.
You need to get spark on the machine that should run as master.
You can download it here
After extracting it, you have spark/sbin folder, there you have start-master.sh script. you need to start it with -h argument.
please note that you need to create a spark-env file like explained here and define the spark local and master variables, this is important on the master machine.
After that, on the worker nodes, use the start-slave.sh script to start worker nodes.
And you are good to go, you can use a spark context inside python to use it!
If you are already using pyspark through conda / pip installation, there’s no need to install Spark and setup environment variables again for cluster setup.
For conda / pip pyspark installation is missing only 'conf'
, 'sbin'
, 'kubernetes'
, 'yarn'
folders, You can simply download Spark and move those folders into the folder where pyspark is located (usually site-packages folder inside python).
After you installed pyspark via pip install pyspark
, you can start the Spark standalone cluster master process using this command:
spark-class org.apache.spark.deploy.master.Master -h 127.0.0.1
And then you can add some workers (executors), which would process the jobs:
spark-class org.apache.spark.deploy.worker.Worker
spark://127.0.0.1:7077
-c 4 -m 8G
Flags -c
and -m
specify the number of CPU cores and amount of memory provided by the worker.
The 127.0.0.1
local address is used there for security reasons (it isn’t good if anyone just copy/pasting this lines would expose an "arbitary code execution service" in their network), but for the distributed standalone Spark cluster setup the different address should be used (ex, a private IP address in an isolated network available only for this cluster nodes and their intended users, and an official Spark security guide should be read).
The spark-class
script is contained in the "pyspark" python package, and it is a wrapper to load the environment variables from spark-env.sh
and add the corresponding spark jars locations to -cp
flag of java
command.
If you may need to configure the environment – consult the official Spark docs, but it also works and may be suitable for the regular usage with default parameters. Also, see the flags for the master/worker commands using their --help
.
This is an example how to connect to this standalone cluster using pyspark
script with ipython
shell:
PYSPARK_DRIVER_PYTHON=ipython
pyspark --master spark://127.0.0.1:7077
--num-executors 2
--executor-cores 2
--executor-memory 4G
The code for instantiating spark session manually, ex. in Jupyter:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.master("spark://127.0.0.1:7077")
# the number of executors this job needs
.config("spark.executor.instances", 2)
# the number of CPU cores memory this needs from the executor,
# it would be reserved on the worker
.config("spark.executor.cores", "2")
.config("spark.executor.memory", "4G")
.getOrCreate()
)