How to start a standalone cluster using pyspark?

Question:

I am using pyspark under ubuntu with python 2.7
I installed it using

pip install pyspark --user 

And trying to follow the instruction to setup spark cluster

I can’t find the script start-master.sh
I assume that it has to do with the fact that i installed pyspark and not regular spark

I found here that i can connect a worker node to the master via pyspark, but how do i start the master node with pyspark?

Asked By: thebeancounter

||

Answers:

https://pypi.python.org/pypi/pyspark

The Python packaging for Spark is not intended to replace all … use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) – but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.

Answered By: OneCricketeer

Well i did a bit of a mix-up in the op.

You need to get spark on the machine that should run as master.
You can download it here

After extracting it, you have spark/sbin folder, there you have start-master.sh script. you need to start it with -h argument.

please note that you need to create a spark-env file like explained here and define the spark local and master variables, this is important on the master machine.

After that, on the worker nodes, use the start-slave.sh script to start worker nodes.

And you are good to go, you can use a spark context inside python to use it!

Answered By: thebeancounter

If you are already using pyspark through conda / pip installation, there’s no need to install Spark and setup environment variables again for cluster setup.

For conda / pip pyspark installation is missing only 'conf', 'sbin' , 'kubernetes', 'yarn' folders, You can simply download Spark and move those folders into the folder where pyspark is located (usually site-packages folder inside python).

Answered By: Matthew Son

After you installed pyspark via pip install pyspark, you can start the Spark standalone cluster master process using this command:

spark-class org.apache.spark.deploy.master.Master -h 127.0.0.1

And then you can add some workers (executors), which would process the jobs:

spark-class org.apache.spark.deploy.worker.Worker 
    spark://127.0.0.1:7077 
    -c 4 -m 8G

Flags -c and -m specify the number of CPU cores and amount of memory provided by the worker.

The 127.0.0.1 local address is used there for security reasons (it isn’t good if anyone just copy/pasting this lines would expose an "arbitary code execution service" in their network), but for the distributed standalone Spark cluster setup the different address should be used (ex, a private IP address in an isolated network available only for this cluster nodes and their intended users, and an official Spark security guide should be read).

The spark-class script is contained in the "pyspark" python package, and it is a wrapper to load the environment variables from spark-env.sh and add the corresponding spark jars locations to -cp flag of java command.

If you may need to configure the environment – consult the official Spark docs, but it also works and may be suitable for the regular usage with default parameters. Also, see the flags for the master/worker commands using their --help.

This is an example how to connect to this standalone cluster using pyspark script with ipython shell:

PYSPARK_DRIVER_PYTHON=ipython 
    pyspark --master spark://127.0.0.1:7077 
    --num-executors 2
    --executor-cores 2
    --executor-memory 4G

The code for instantiating spark session manually, ex. in Jupyter:

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .master("spark://127.0.0.1:7077")
    # the number of executors this job needs
    .config("spark.executor.instances", 2)
    # the number of CPU cores memory this needs from the executor,
    # it would be reserved on the worker
    .config("spark.executor.cores", "2")
    .config("spark.executor.memory", "4G")
    .getOrCreate()
)
Answered By: ei-grad
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.