How does the pyspark mapPartitions function work?
Question:
So I am trying to learn Spark using Python (Pyspark). I want to know how the function mapPartitions
work. That is what Input it takes and what Output it gives. I couldn’t find any proper example from the internet. Lets say, I have an RDD object containing lists, such as below.
[ [1, 2, 3], [3, 2, 4], [5, 2, 7] ]
And I want to remove element 2 from all the lists, how would I achieve that using mapPartitions
.
Answers:
mapPartition
should be thought of as a map operation over partitions and not over the elements of the partition. It’s input is the set of current partitions its output will be another set of partitions.
The function you pass to map
operation must take an individual element of your RDD
The function you pass to mapPartition
must take an iterable of your RDD type and return an iterable of some other or the same type.
In your case you probably just want to do something like:
def filter_out_2(line):
return [x for x in line if x != 2]
filtered_lists = data.map(filterOut2)
If you wanted to use mapPartition
it would be:
def filter_out_2_from_partition(list_of_lists):
final_iterator = []
for sub_list in list_of_lists:
final_iterator.append( [x for x in sub_list if x != 2])
return iter(final_iterator)
filtered_lists = data.mapPartition(filterOut2FromPartion)
It’s easier to use mapPartitions with a generator function using the yield
syntax:
def filter_out_2(partition):
for element in partition:
if element != 2:
yield element
filtered_lists = data.mapPartitions(filter_out_2)
Need a final Iter
def filter_out_2(partition):
for element in partition:
sec_iterator = []
for i in element:
if i!= 2:
sec_iterator.append(i)
yield sec_iterator
filtered_lists = data.mapPartitions(filter_out_2)
for i in filtered_lists.collect(): print(i)
def func(l):
for i in l:
yield i+"ajbf"
mylist=['madhu','sdgs','sjhf','mad']
rdd=sc.parallelize(mylist)
t=rdd.mapPartitions(func)
for i in t.collect():
print(i)
for i in t.collect():
print(i)
in the above code I am able get data from 2nd for..in loop..
as per generator it should not should values once its iterate over the loop
So I am trying to learn Spark using Python (Pyspark). I want to know how the function mapPartitions
work. That is what Input it takes and what Output it gives. I couldn’t find any proper example from the internet. Lets say, I have an RDD object containing lists, such as below.
[ [1, 2, 3], [3, 2, 4], [5, 2, 7] ]
And I want to remove element 2 from all the lists, how would I achieve that using mapPartitions
.
mapPartition
should be thought of as a map operation over partitions and not over the elements of the partition. It’s input is the set of current partitions its output will be another set of partitions.
The function you pass to map
operation must take an individual element of your RDD
The function you pass to mapPartition
must take an iterable of your RDD type and return an iterable of some other or the same type.
In your case you probably just want to do something like:
def filter_out_2(line):
return [x for x in line if x != 2]
filtered_lists = data.map(filterOut2)
If you wanted to use mapPartition
it would be:
def filter_out_2_from_partition(list_of_lists):
final_iterator = []
for sub_list in list_of_lists:
final_iterator.append( [x for x in sub_list if x != 2])
return iter(final_iterator)
filtered_lists = data.mapPartition(filterOut2FromPartion)
It’s easier to use mapPartitions with a generator function using the yield
syntax:
def filter_out_2(partition):
for element in partition:
if element != 2:
yield element
filtered_lists = data.mapPartitions(filter_out_2)
Need a final Iter
def filter_out_2(partition):
for element in partition:
sec_iterator = []
for i in element:
if i!= 2:
sec_iterator.append(i)
yield sec_iterator
filtered_lists = data.mapPartitions(filter_out_2)
for i in filtered_lists.collect(): print(i)
def func(l):
for i in l:
yield i+"ajbf"
mylist=['madhu','sdgs','sjhf','mad']
rdd=sc.parallelize(mylist)
t=rdd.mapPartitions(func)
for i in t.collect():
print(i)
for i in t.collect():
print(i)
in the above code I am able get data from 2nd for..in loop..
as per generator it should not should values once its iterate over the loop