Spark RDD – Mapping with extra arguments
Question:
Is it possible to pass extra arguments to the mapping function in pySpark?
Specifically, I have the following code recipe:
raw_data_rdd = sc.textFile("data.json", use_unicode=True)
json_data_rdd = raw_data_rdd.map(lambda line: json.loads(line))
mapped_rdd = json_data_rdd.flatMap(processDataLine)
The function processDataLine
takes extra arguments in addition to the JSON object, as:
def processDataLine(dataline, arg1, arg2)
How can I pass the extra arguments arg1
and arg2
to the flaMap
function?
Answers:
-
You can use an anonymous function either directly in a flatMap
json_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))
or to curry processDataLine
f = lambda j: processDataLine(dataline, arg1, arg2)
json_data_rdd.flatMap(f)
-
You can generate processDataLine
like this:
def processDataLine(arg1, arg2):
def _processDataLine(dataline):
return ... # Do something with dataline, arg1, arg2
return _processDataLine
json_data_rdd.flatMap(processDataLine(arg1, arg2))
-
toolz
library provides useful curry
decorator:
from toolz.functoolz import curry
@curry
def processDataLine(arg1, arg2, dataline):
return ... # Do something with dataline, arg1, arg2
json_data_rdd.flatMap(processDataLine(arg1, arg2))
Note that I’ve pushed dataline
argument to the last position. It is not required but this way we don’t have to use keyword args.
-
Finally there is functools.partial
already mentioned by Avihoo Mamka in the comments.
Is it possible to pass extra arguments to the mapping function in pySpark?
Specifically, I have the following code recipe:
raw_data_rdd = sc.textFile("data.json", use_unicode=True)
json_data_rdd = raw_data_rdd.map(lambda line: json.loads(line))
mapped_rdd = json_data_rdd.flatMap(processDataLine)
The function processDataLine
takes extra arguments in addition to the JSON object, as:
def processDataLine(dataline, arg1, arg2)
How can I pass the extra arguments arg1
and arg2
to the flaMap
function?
-
You can use an anonymous function either directly in a
flatMap
json_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))
or to curry
processDataLine
f = lambda j: processDataLine(dataline, arg1, arg2) json_data_rdd.flatMap(f)
-
You can generate
processDataLine
like this:def processDataLine(arg1, arg2): def _processDataLine(dataline): return ... # Do something with dataline, arg1, arg2 return _processDataLine json_data_rdd.flatMap(processDataLine(arg1, arg2))
-
toolz
library provides usefulcurry
decorator:from toolz.functoolz import curry @curry def processDataLine(arg1, arg2, dataline): return ... # Do something with dataline, arg1, arg2 json_data_rdd.flatMap(processDataLine(arg1, arg2))
Note that I’ve pushed
dataline
argument to the last position. It is not required but this way we don’t have to use keyword args. -
Finally there is
functools.partial
already mentioned by Avihoo Mamka in the comments.