Cartesian product of two RDD in Spark
Question:
I am completely new to Apache Spark and I trying to Cartesian product two RDD. As an example I have A and B like :
A = {(a1,v1),(a2,v2),...}
B = {(b1,s1),(b2,s2),...}
I need a new RDD like:
C = {((a1,v1),(b1,s1)), ((a1,v1),(b2,s2)), ...}
Any idea how I can do this? As simple as possible 🙂
Thanks in advance
PS: I finally did it like this as suggested by @Amit Kumar:
cartesianProduct = A.cartesian(B)
Answers:
That’s not the dot product, that’s the cartesian product. Use the cartesian
method:
def cartesian[U](other: spark.api.java.JavaRDDLike[U, _]): JavaPairRDD[T, U]
Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in this
and b is in other
.
You can do it like the following:
A = {(a1,v1),(a2,v2),...}
B = {(b1,s1),(b2,s2),...}
C = A.cartesian(B)
And if you do:
C.take(5)
You can see that this is what you want.
Just in case if you are curious on how to do with multiple lists, here’s an example in pyspark
>>> a = [1,2,3]
>>> b = [5,6,7,8]
>>> c = [11,22,33,44,55]
>>> import itertools
>>> abcCartesianRDD = sc.parallelize(itertools.product(a,b,c))
>>> abcCartesianRDD.count() #Test
60
I am completely new to Apache Spark and I trying to Cartesian product two RDD. As an example I have A and B like :
A = {(a1,v1),(a2,v2),...}
B = {(b1,s1),(b2,s2),...}
I need a new RDD like:
C = {((a1,v1),(b1,s1)), ((a1,v1),(b2,s2)), ...}
Any idea how I can do this? As simple as possible 🙂
Thanks in advance
PS: I finally did it like this as suggested by @Amit Kumar:
cartesianProduct = A.cartesian(B)
That’s not the dot product, that’s the cartesian product. Use the cartesian
method:
def cartesian[U](other: spark.api.java.JavaRDDLike[U, _]): JavaPairRDD[T, U]
Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in
this
and b is inother
.
You can do it like the following:
A = {(a1,v1),(a2,v2),...}
B = {(b1,s1),(b2,s2),...}
C = A.cartesian(B)
And if you do:
C.take(5)
You can see that this is what you want.
Just in case if you are curious on how to do with multiple lists, here’s an example in pyspark
>>> a = [1,2,3]
>>> b = [5,6,7,8]
>>> c = [11,22,33,44,55]
>>> import itertools
>>> abcCartesianRDD = sc.parallelize(itertools.product(a,b,c))
>>> abcCartesianRDD.count() #Test
60