How to include external Spark library while using PySpark in Jupyter notebook
Question:
I am trying to run the following PySpark-Kafka streaming example in a Jupyter Notebook. Here is the first part of the code I am using in my notebook:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = pyspark.SparkContext(master='local[*]',appName="PySpark streaming")
ssc = StreamingContext(sc, 2)
topic = "my-topic"
brokers = "localhost:9092"
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
If I run the cell, I receive the following error/description:
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.3.0.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
My questions are: how can I pass the –jars or –package argument to Jupyter Notebook? Or, can I download this package and link it permanently to Python/Jupyter (maybe via an export in the .bashrc)?
Answers:
There are at least two ways of doing so, corresponding roughly to the two options suggested in the error message:
The first way is to update accordingly your respective Jupyter kernel (if you are not already using Jupyter kernels, you should – see this answer for the detailed generalities of using kernels in Jupyter for Pyspark).
More specifically, you should update your respective kernel.json
configuration file for Pyspark with the following entry under env
(if you use something else than --master local
, modify accordingly):
"PYSPARK_SUBMIT_ARGS": "--master local --packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0 pyspark-shell"
The second way is to put the following entry in your spark-defaults.conf
file:
spark.jars.packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0
In both cases, you don’t need to download anything manually – the first time you run Pyspark with the updated configuration, the necessary files will be downloaded and put in the appropriate directories.
This is how I can config to run PySpark (verison with scala 2.12 Spark 3.2.1) Structure Streaming with Kafka on jupyter lab
First,I download 5 jars files and I put them in the folder /jars under my current project folder (just for local run I think):
- spark-sql-kafka-0-10_2.12-3.2.1.jar
- kafka-clients-2.1.1.jar
- spark-streaming-kafka-0-10-assembly_2.12-3.2.1.jar
- commons-pool2-2.8.0.jar
- spark-token-provider-kafka-0-10_2.12-3.2.1.jar
The value of config spark.jars look like this "<path-to-jar/test1.jar>,<path-to-jar/test2.jar>"
This is the actual code:
spark_jars = ("{},{},{},{},{}".format(os.getcwd() + "/jars/spark-sql-kafka-0-10_2.12-3.2.1.jar",
os.getcwd() + "/jars/kafka-clients-2.1.1.jar",
os.getcwd() + "/jars/spark-streaming-kafka-0-10-assembly_2.12-3.2.1.jar",
os.getcwd() + "/jars/commons-pool2-2.8.0.jar",
os.getcwd() + "/jars/spark-token-provider-kafka-0-10_2.12-3.2.1.jar"))
spark = SparkSession.builder.config("spark.jars", spark_jars).appName("Structured_Redpanda_WordCount").getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 1)
I am trying to run the following PySpark-Kafka streaming example in a Jupyter Notebook. Here is the first part of the code I am using in my notebook:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = pyspark.SparkContext(master='local[*]',appName="PySpark streaming")
ssc = StreamingContext(sc, 2)
topic = "my-topic"
brokers = "localhost:9092"
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
If I run the cell, I receive the following error/description:
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.3.0.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
My questions are: how can I pass the –jars or –package argument to Jupyter Notebook? Or, can I download this package and link it permanently to Python/Jupyter (maybe via an export in the .bashrc)?
There are at least two ways of doing so, corresponding roughly to the two options suggested in the error message:
The first way is to update accordingly your respective Jupyter kernel (if you are not already using Jupyter kernels, you should – see this answer for the detailed generalities of using kernels in Jupyter for Pyspark).
More specifically, you should update your respective kernel.json
configuration file for Pyspark with the following entry under env
(if you use something else than --master local
, modify accordingly):
"PYSPARK_SUBMIT_ARGS": "--master local --packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0 pyspark-shell"
The second way is to put the following entry in your spark-defaults.conf
file:
spark.jars.packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0
In both cases, you don’t need to download anything manually – the first time you run Pyspark with the updated configuration, the necessary files will be downloaded and put in the appropriate directories.
This is how I can config to run PySpark (verison with scala 2.12 Spark 3.2.1) Structure Streaming with Kafka on jupyter lab
First,I download 5 jars files and I put them in the folder /jars under my current project folder (just for local run I think):
- spark-sql-kafka-0-10_2.12-3.2.1.jar
- kafka-clients-2.1.1.jar
- spark-streaming-kafka-0-10-assembly_2.12-3.2.1.jar
- commons-pool2-2.8.0.jar
- spark-token-provider-kafka-0-10_2.12-3.2.1.jar
The value of config spark.jars look like this "<path-to-jar/test1.jar>,<path-to-jar/test2.jar>"
This is the actual code:
spark_jars = ("{},{},{},{},{}".format(os.getcwd() + "/jars/spark-sql-kafka-0-10_2.12-3.2.1.jar",
os.getcwd() + "/jars/kafka-clients-2.1.1.jar",
os.getcwd() + "/jars/spark-streaming-kafka-0-10-assembly_2.12-3.2.1.jar",
os.getcwd() + "/jars/commons-pool2-2.8.0.jar",
os.getcwd() + "/jars/spark-token-provider-kafka-0-10_2.12-3.2.1.jar"))
spark = SparkSession.builder.config("spark.jars", spark_jars).appName("Structured_Redpanda_WordCount").getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 1)