How to use a Scala class inside Pyspark
Question:
I’ve been searching for a while if there is any way to use a Scala
class in Pyspark
, and I haven’t found any documentation nor guide about this subject.
Let’s say I create a simple class in Scala
that uses some libraries of apache-spark
, something like:
class SimpleClass(sqlContext: SQLContext, df: DataFrame, column: String) {
def exe(): DataFrame = {
import sqlContext.implicits._
df.select(col(column))
}
}
- Is there any possible way to use this class in
Pyspark
?
- Is it too tough?
- Do I have to create a
.py
file?
- Is there any guide that shows how to do that?
By the way I also looked at the spark
code and I felt a bit lost, and I was incapable of replicating their functionality for my own purpose.
Answers:
Yes it is possible although can be far from trivial. Typically you want a Java (friendly) wrapper so you don’t have to deal with Scala features which cannot be easily expressed using plain Java and as a result don’t play well with Py4J gateway.
Assuming your class is int the package com.example
and have Python DataFrame
called df
df = ... # Python DataFrame
you’ll have to:
-
Build a jar using your favorite build tool.
-
Include it in the driver classpath for example using --driver-class-path
argument for PySpark shell / spark-submit
. Depending on the exact code you may have to pass it using --jars
as well
-
Extract JVM instance from a Python SparkContext
instance:
jvm = sc._jvm
-
Extract Scala SQLContext
from a SQLContext
instance:
ssqlContext = sqlContext._ssql_ctx
-
Extract Java DataFrame
from the df
:
jdf = df._jdf
-
Create new instance of SimpleClass
:
simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")
-
Callexe
method and wrap the result using Python DataFrame
:
from pyspark.sql import DataFrame
DataFrame(simpleObject.exe(), ssqlContext)
The result should be a valid PySpark DataFrame
. You can of course combine all the steps into a single call.
Important: This approach is possible only if Python code is executed solely on the driver. It cannot be used inside Python action or transformation. See How to use Java/Scala function from an action or a transformation? for details.
As an update to @zero323‘s answer, given that Spark’s APIs have evolved over the last six years, a recipe that works in Spark-3.2 is as follows:
- Compile your Scala code into a JAR file (e.g. using
sbt assembly
)
- Include the JAR file in the
--jars
argument to spark-submit
together with any --py-files
arguments needed for local package definitions
- Extract the JVM instance within Python:
jvm = spark._jvm
- Extract a Java representation of the
SparkSession
:
jSess = spark._jsparkSession
- Extract the Java handle for the PySpark
DataFrame
"df" that you want to pass into the Scala method:
jdf = df._jdf
- Create a new instance of
SimpleClass
from within PySpark:
simpleObject = jvm.com.example.SimpleClass(jSess, jdf, "v")
- Call the
exe
method and convert its output into a PySpark DataFrame
:
from pyspark.sql import DataFrame
result = DataFrame(simpleObject.exe(), spark)
If you need to pass additional parameters, such as a Python dictionary, PySpark may automatically convert them into corresponding Java types they before emerge into your Scala methods. Scala provides the JavaConverters
package to help with translating this into more natural Scala datatypes. For example, a Python dictionary could be passed into a Scala method and immediately converted from a Java HashMap into a Scala (mutable) Map:
def processDict(spark: SparkSession, jparams: java.util.Map[String, Any]) {
import scala.collection.JavaConverters._
val params = jparams.asScala
...
}
I’ve been searching for a while if there is any way to use a Scala
class in Pyspark
, and I haven’t found any documentation nor guide about this subject.
Let’s say I create a simple class in Scala
that uses some libraries of apache-spark
, something like:
class SimpleClass(sqlContext: SQLContext, df: DataFrame, column: String) {
def exe(): DataFrame = {
import sqlContext.implicits._
df.select(col(column))
}
}
- Is there any possible way to use this class in
Pyspark
? - Is it too tough?
- Do I have to create a
.py
file? - Is there any guide that shows how to do that?
By the way I also looked at the spark
code and I felt a bit lost, and I was incapable of replicating their functionality for my own purpose.
Yes it is possible although can be far from trivial. Typically you want a Java (friendly) wrapper so you don’t have to deal with Scala features which cannot be easily expressed using plain Java and as a result don’t play well with Py4J gateway.
Assuming your class is int the package com.example
and have Python DataFrame
called df
df = ... # Python DataFrame
you’ll have to:
-
Build a jar using your favorite build tool.
-
Include it in the driver classpath for example using
--driver-class-path
argument for PySpark shell /spark-submit
. Depending on the exact code you may have to pass it using--jars
as well -
Extract JVM instance from a Python
SparkContext
instance:jvm = sc._jvm
-
Extract Scala
SQLContext
from aSQLContext
instance:ssqlContext = sqlContext._ssql_ctx
-
Extract Java
DataFrame
from thedf
:jdf = df._jdf
-
Create new instance of
SimpleClass
:simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")
-
Call
exe
method and wrap the result using PythonDataFrame
:from pyspark.sql import DataFrame DataFrame(simpleObject.exe(), ssqlContext)
The result should be a valid PySpark DataFrame
. You can of course combine all the steps into a single call.
Important: This approach is possible only if Python code is executed solely on the driver. It cannot be used inside Python action or transformation. See How to use Java/Scala function from an action or a transformation? for details.
As an update to @zero323‘s answer, given that Spark’s APIs have evolved over the last six years, a recipe that works in Spark-3.2 is as follows:
- Compile your Scala code into a JAR file (e.g. using
sbt assembly
) - Include the JAR file in the
--jars
argument tospark-submit
together with any--py-files
arguments needed for local package definitions - Extract the JVM instance within Python:
jvm = spark._jvm
- Extract a Java representation of the
SparkSession
:
jSess = spark._jsparkSession
- Extract the Java handle for the PySpark
DataFrame
"df" that you want to pass into the Scala method:
jdf = df._jdf
- Create a new instance of
SimpleClass
from within PySpark:
simpleObject = jvm.com.example.SimpleClass(jSess, jdf, "v")
- Call the
exe
method and convert its output into a PySparkDataFrame
:
from pyspark.sql import DataFrame
result = DataFrame(simpleObject.exe(), spark)
If you need to pass additional parameters, such as a Python dictionary, PySpark may automatically convert them into corresponding Java types they before emerge into your Scala methods. Scala provides the JavaConverters
package to help with translating this into more natural Scala datatypes. For example, a Python dictionary could be passed into a Scala method and immediately converted from a Java HashMap into a Scala (mutable) Map:
def processDict(spark: SparkSession, jparams: java.util.Map[String, Any]) {
import scala.collection.JavaConverters._
val params = jparams.asScala
...
}