How to add third-party Java JAR files for use in PySpark
Question:
I have some third-party database client libraries in Java. I want to access them through
java_gateway.py
E.g.: to make the client class (not a JDBC driver!) available to the Python client via the Java gateway:
java_import(gateway.jvm, "org.mydatabase.MyDBClient")
It is not clear where to add the third-party libraries to the JVM classpath. I tried to add to file compute-classpath.sh, but that did not seem to work. I get:
Py4jError: Trying to call a package
Also, when comparing to Hive: the hive JAR files are not loaded via file compute-classpath.sh, so that makes me suspicious. There seems to be some other mechanism happening to set up the JVM side classpath.
Answers:
You can add external jars as arguments to pyspark
pyspark --jars file1.jar,file2.jar
You could add --jars xxx.jar
when using spark-submit
./bin/spark-submit --jars xxx.jar your_spark_script.py
or set the enviroment variable SPARK_CLASSPATH
SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py
your_spark_script.py
was written by pyspark API
- Extract the downloaded jar file.
- Edit system environment variable
- Add a variable named SPARK_CLASSPATH and set its value to pathtotheextractedjarfile.
Eg: you have extracted the jar file in C drive in folder named sparkts
its value should be: C:sparkts
- Restart your cluster
You could add the path to jar file using Spark configuration at Runtime.
Here is an example :
conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")
sc = SparkContext( conf=conf)
Refer the document for more information.
One more thing you can do is to add the Jar in the pyspark jar folder where pyspark is installed. Usually /python3.6/site-packages/pyspark/jars
Be careful if you are using a virtual environment that the jar needs to go to the pyspark installation in the virtual environment.
This way you can use the jar without sending it in command line or load it in your code.
All the above answers did not work for me
What I had to do with pyspark was
pyspark --py-files /path/to/jar/xxxx.jar
For Jupyter Notebook:
spark = (SparkSession
.builder
.appName("Spark_Test")
.master('yarn-client')
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.config("spark.executor.cores", "4")
.config("spark.executor.instances", "2")
.config("spark.sql.shuffle.partitions","8")
.enableHiveSupport()
.getOrCreate())
# Do this
spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")
Link to the source where I found it:
https://github.com/graphframes/graphframes/issues/104
I’ve worked around this by dropping the jars into a directory drivers and then creating a spark-defaults.conf file in conf folder. Steps to follow;
To get the conf path:
cd ${SPARK_HOME}/conf
vi spark-defaults.conf
spark.driver.extraClassPath /Users/xxx/Documents/spark_project/drivers/*
run your Jupyter notebook.
Apart from the accepted answer, you also have below options:
-
if you are in virtual environment then you can place it in
e.g. lib/python3.7/site-packages/pyspark/jars
-
if you want java to discover it then you can place where your jre is installed under ext/
directory
java/scala libs from pyspark both --jars
and spark.jars
are not working in version 2.4.0 and earlier (I didn’t check newer version). I’m surprised how many guys are claiming that it is working.
The main problem is that for classloader retrieved in following way:
jvm = SparkSession.builder.getOrCreate()._jvm
clazz = jvm.my.scala.class
# or
clazz = jvm.java.lang.Class.forName('my.scala.class')
it works only when you copy jar files to ${SPARK_HOME}/jars (this one works for me).
But when your only way is using --jars
or spark.jars
there is another classloader used (which is child class loader) which is set in current thread. So your python code needs to look like:
clazz = jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(f"{object_name}$")
Hope it explains your troubles. Give me a shout if not.
I have some third-party database client libraries in Java. I want to access them through
java_gateway.py
E.g.: to make the client class (not a JDBC driver!) available to the Python client via the Java gateway:
java_import(gateway.jvm, "org.mydatabase.MyDBClient")
It is not clear where to add the third-party libraries to the JVM classpath. I tried to add to file compute-classpath.sh, but that did not seem to work. I get:
Py4jError: Trying to call a package
Also, when comparing to Hive: the hive JAR files are not loaded via file compute-classpath.sh, so that makes me suspicious. There seems to be some other mechanism happening to set up the JVM side classpath.
You can add external jars as arguments to pyspark
pyspark --jars file1.jar,file2.jar
You could add --jars xxx.jar
when using spark-submit
./bin/spark-submit --jars xxx.jar your_spark_script.py
or set the enviroment variable SPARK_CLASSPATH
SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py
your_spark_script.py
was written by pyspark API
- Extract the downloaded jar file.
- Edit system environment variable
- Add a variable named SPARK_CLASSPATH and set its value to pathtotheextractedjarfile.
Eg: you have extracted the jar file in C drive in folder named sparkts
its value should be: C:sparkts
- Restart your cluster
You could add the path to jar file using Spark configuration at Runtime.
Here is an example :
conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")
sc = SparkContext( conf=conf)
Refer the document for more information.
One more thing you can do is to add the Jar in the pyspark jar folder where pyspark is installed. Usually /python3.6/site-packages/pyspark/jars
Be careful if you are using a virtual environment that the jar needs to go to the pyspark installation in the virtual environment.
This way you can use the jar without sending it in command line or load it in your code.
All the above answers did not work for me
What I had to do with pyspark was
pyspark --py-files /path/to/jar/xxxx.jar
For Jupyter Notebook:
spark = (SparkSession
.builder
.appName("Spark_Test")
.master('yarn-client')
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.config("spark.executor.cores", "4")
.config("spark.executor.instances", "2")
.config("spark.sql.shuffle.partitions","8")
.enableHiveSupport()
.getOrCreate())
# Do this
spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")
Link to the source where I found it:
https://github.com/graphframes/graphframes/issues/104
I’ve worked around this by dropping the jars into a directory drivers and then creating a spark-defaults.conf file in conf folder. Steps to follow;
To get the conf path:
cd ${SPARK_HOME}/conf
vi spark-defaults.conf
spark.driver.extraClassPath /Users/xxx/Documents/spark_project/drivers/*
run your Jupyter notebook.
Apart from the accepted answer, you also have below options:
-
if you are in virtual environment then you can place it in
e.g.
lib/python3.7/site-packages/pyspark/jars
-
if you want java to discover it then you can place where your jre is installed under
ext/
directory
java/scala libs from pyspark both --jars
and spark.jars
are not working in version 2.4.0 and earlier (I didn’t check newer version). I’m surprised how many guys are claiming that it is working.
The main problem is that for classloader retrieved in following way:
jvm = SparkSession.builder.getOrCreate()._jvm
clazz = jvm.my.scala.class
# or
clazz = jvm.java.lang.Class.forName('my.scala.class')
it works only when you copy jar files to ${SPARK_HOME}/jars (this one works for me).
But when your only way is using --jars
or spark.jars
there is another classloader used (which is child class loader) which is set in current thread. So your python code needs to look like:
clazz = jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(f"{object_name}$")
Hope it explains your troubles. Give me a shout if not.