How to include external Spark library while using PySpark in Jupyter notebook

Question:

I am trying to run the following PySpark-Kafka streaming example in a Jupyter Notebook. Here is the first part of the code I am using in my notebook:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

sc = pyspark.SparkContext(master='local[*]',appName="PySpark streaming")
ssc = StreamingContext(sc, 2)

topic = "my-topic"
brokers = "localhost:9092"
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})

If I run the cell, I receive the following error/description:

Spark Streaming's Kafka libraries not found in class path. Try one of the following.

1. Include the Kafka library and its dependencies with in the
 spark-submit command as

$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0 ...

2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.3.0.
Then, include the jar in the spark-submit command as

$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

My questions are: how can I pass the –jars or –package argument to Jupyter Notebook? Or, can I download this package and link it permanently to Python/Jupyter (maybe via an export in the .bashrc)?

Asked By: Marcel

||

Answers:

There are at least two ways of doing so, corresponding roughly to the two options suggested in the error message:

The first way is to update accordingly your respective Jupyter kernel (if you are not already using Jupyter kernels, you should – see this answer for the detailed generalities of using kernels in Jupyter for Pyspark).

More specifically, you should update your respective kernel.json configuration file for Pyspark with the following entry under env (if you use something else than --master local, modify accordingly):

"PYSPARK_SUBMIT_ARGS": "--master local --packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0 pyspark-shell"

The second way is to put the following entry in your spark-defaults.conf file:

spark.jars.packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0

In both cases, you don’t need to download anything manually – the first time you run Pyspark with the updated configuration, the necessary files will be downloaded and put in the appropriate directories.

Answered By: desertnaut

This is how I can config to run PySpark (verison with scala 2.12 Spark 3.2.1) Structure Streaming with Kafka on jupyter lab

First,I download 5 jars files and I put them in the folder /jars under my current project folder (just for local run I think):

  • spark-sql-kafka-0-10_2.12-3.2.1.jar
  • kafka-clients-2.1.1.jar
  • spark-streaming-kafka-0-10-assembly_2.12-3.2.1.jar
  • commons-pool2-2.8.0.jar
  • spark-token-provider-kafka-0-10_2.12-3.2.1.jar

The value of config spark.jars look like this "<path-to-jar/test1.jar>,<path-to-jar/test2.jar>"

This is the actual code:

spark_jars =  ("{},{},{},{},{}".format(os.getcwd() + "/jars/spark-sql-kafka-0-10_2.12-3.2.1.jar",  
                                      os.getcwd() + "/jars/kafka-clients-2.1.1.jar", 
                                      os.getcwd() + "/jars/spark-streaming-kafka-0-10-assembly_2.12-3.2.1.jar", 
                                      os.getcwd() + "/jars/commons-pool2-2.8.0.jar",  
                                      os.getcwd() + "/jars/spark-token-provider-kafka-0-10_2.12-3.2.1.jar"))


spark = SparkSession.builder.config("spark.jars", spark_jars).appName("Structured_Redpanda_WordCount").getOrCreate()

spark.conf.set("spark.sql.shuffle.partitions", 1)
Answered By: Hoang Trung Nghia