Yield both max and min in a single mapreduce
Question:
I am a beginner just getting started with writing MapReduce programs in Python using MRJob library.
One of the example worked out in the video tutorial is to find a max temperature by location_id. Following on from that writing another program to find the min temperature by location_id is straightforward too.
I am wondering, is there a way to yield both max and min temperature by location_id in a single mapreduce program?. Below is my go at it:
from mrjob.job import MRJob
'''Sample Data
ITE00100554,18000101,TMAX,-75,,,E,
ITE00100554,18000101,TMIN,-148,,,E,
GM000010962,18000101,PRCP,0,,,E,
EZE00100082,18000101,TMAX,-86,,,E,
EZE00100082,18000101,TMIN,-135,,,E,
ITE00100554,18000102,TMAX,-60,,I,E,
ITE00100554,18000102,TMIN,-125,,,E,
GM000010962,18000102,PRCP,0,,,E,
EZE00100082,18000102,TMAX,-44,,,E,
Output I am expecting to see:
ITE00100554 32.3 20.2
EZE00100082 34.4 19.6
'''
class MaxMinTemperature(MRJob):
def mapper(self, _, line):
location, datetime, measure, temperature, w, x, y, z = line.split(',')
temperature = float(temperature)/10
if measure == 'TMAX' or measure == 'TMIN':
yield location, temperature
def reducer(self, location, temperatures):
yield location, max(temperatures), min(temperatures)
if __name__ == '__main__':
MaxMinTemperature.run()
I get the following error:
File "MaxMinTemperature.py", line 12, in reducer
yield location, max(temperatures), min(temperatures)
ValueError: min() arg is an empty sequence
Is this possible?
Thank you for your assistance.
Shiv
Answers:
You have two problems in reducer:
-
If you check type of the temperature argument, you will see that it’s a generator. A generator can be traversed only once so you cannot pass the same generator to both ‘min’ and ‘max’ functions. The right solution is to manually traverse it. A wrong solution – converting it to a list – may cause out of memory error on big enough input because a list holds all its elements in memory and a generator does not.
-
Result of reducer must be a two-elements tuple. So you need to combine your min and max temperature in another tuple.
Complete working solution:
class MaxMinTemperature(MRJob):
def mapper(self, _, line):
location, datetime, measure, temperature, w, x, y, z = line.split(',')
temperature = float(temperature)/10
if measure in ('TMAX', 'TMIN'):
yield location, temperature
def reducer(self, location, temperatures):
min_temp = next(temperatures)
max_temp = min_temp
for item in temperatures:
min_temp = min(item, min_temp)
max_temp = max(item, max_temp)
yield location, (min_temp, max_temp)
The problem is that temperatures
in your reducer
method is a generator.
For better understanding let’s create a simple generator and look on its behavior:
def my_gen(an_iterable):
for item in an_iterable:
yield item
my_generator = my_gen([1,2,3,4,5])
print(type(my_generator)) # <class 'generator'>
Оne of the features of such an object is that once exhausted, you can’t reuse it:
print(list(my_generator)) # [1, 2, 3, 4, 5]
print(list(my_generator)) # []
Therefore sequential execution of max()
and min()
leads to an error:
my_generator = my_gen([1,2,3,4,5])
print(max(my_generator)) # 5
print(min(my_generator)) # ValueError: min() arg is an empty sequence
So, you can’t use the same generator with both max()
and min()
built-in functions because in the second use the generator will be exhausted.
Instead you can:
1) convert the generator to a list and work with it:
my_generator = my_gen([1,2,3,4,5])
my_list = list(my_generator)
print(max(my_list)) # 5
print(min(my_list)) # 1
2) or extract min and max values of the generator within 1 for-loop:
my_generator = my_gen([1,2,3,4,5])
from functools import reduce
val_max, val_min = reduce(lambda x,y: (max(y, x[0]), min(y, x[1])), my_generator, (float('-inf'), float('inf')))
print(val_max, val_min) # 5 1
So, the following edit of reducer
:
def reducer(self, location, temperatures):
tempr_list = list(temperatures)
yield location, max(tempr_list), min(tempr_list)
should fix the error.
Can Anybody help how to find Average Along with maximum and Minimum in this problem?
I am a beginner just getting started with writing MapReduce programs in Python using MRJob library.
One of the example worked out in the video tutorial is to find a max temperature by location_id. Following on from that writing another program to find the min temperature by location_id is straightforward too.
I am wondering, is there a way to yield both max and min temperature by location_id in a single mapreduce program?. Below is my go at it:
from mrjob.job import MRJob
'''Sample Data
ITE00100554,18000101,TMAX,-75,,,E,
ITE00100554,18000101,TMIN,-148,,,E,
GM000010962,18000101,PRCP,0,,,E,
EZE00100082,18000101,TMAX,-86,,,E,
EZE00100082,18000101,TMIN,-135,,,E,
ITE00100554,18000102,TMAX,-60,,I,E,
ITE00100554,18000102,TMIN,-125,,,E,
GM000010962,18000102,PRCP,0,,,E,
EZE00100082,18000102,TMAX,-44,,,E,
Output I am expecting to see:
ITE00100554 32.3 20.2
EZE00100082 34.4 19.6
'''
class MaxMinTemperature(MRJob):
def mapper(self, _, line):
location, datetime, measure, temperature, w, x, y, z = line.split(',')
temperature = float(temperature)/10
if measure == 'TMAX' or measure == 'TMIN':
yield location, temperature
def reducer(self, location, temperatures):
yield location, max(temperatures), min(temperatures)
if __name__ == '__main__':
MaxMinTemperature.run()
I get the following error:
File "MaxMinTemperature.py", line 12, in reducer
yield location, max(temperatures), min(temperatures)
ValueError: min() arg is an empty sequence
Is this possible?
Thank you for your assistance.
Shiv
You have two problems in reducer:
-
If you check type of the temperature argument, you will see that it’s a generator. A generator can be traversed only once so you cannot pass the same generator to both ‘min’ and ‘max’ functions. The right solution is to manually traverse it. A wrong solution – converting it to a list – may cause out of memory error on big enough input because a list holds all its elements in memory and a generator does not.
-
Result of reducer must be a two-elements tuple. So you need to combine your min and max temperature in another tuple.
Complete working solution:
class MaxMinTemperature(MRJob):
def mapper(self, _, line):
location, datetime, measure, temperature, w, x, y, z = line.split(',')
temperature = float(temperature)/10
if measure in ('TMAX', 'TMIN'):
yield location, temperature
def reducer(self, location, temperatures):
min_temp = next(temperatures)
max_temp = min_temp
for item in temperatures:
min_temp = min(item, min_temp)
max_temp = max(item, max_temp)
yield location, (min_temp, max_temp)
The problem is that temperatures
in your reducer
method is a generator.
For better understanding let’s create a simple generator and look on its behavior:
def my_gen(an_iterable):
for item in an_iterable:
yield item
my_generator = my_gen([1,2,3,4,5])
print(type(my_generator)) # <class 'generator'>
Оne of the features of such an object is that once exhausted, you can’t reuse it:
print(list(my_generator)) # [1, 2, 3, 4, 5]
print(list(my_generator)) # []
Therefore sequential execution of max()
and min()
leads to an error:
my_generator = my_gen([1,2,3,4,5])
print(max(my_generator)) # 5
print(min(my_generator)) # ValueError: min() arg is an empty sequence
So, you can’t use the same generator with both max()
and min()
built-in functions because in the second use the generator will be exhausted.
Instead you can:
1) convert the generator to a list and work with it:
my_generator = my_gen([1,2,3,4,5])
my_list = list(my_generator)
print(max(my_list)) # 5
print(min(my_list)) # 1
2) or extract min and max values of the generator within 1 for-loop:
my_generator = my_gen([1,2,3,4,5])
from functools import reduce
val_max, val_min = reduce(lambda x,y: (max(y, x[0]), min(y, x[1])), my_generator, (float('-inf'), float('inf')))
print(val_max, val_min) # 5 1
So, the following edit of reducer
:
def reducer(self, location, temperatures):
tempr_list = list(temperatures)
yield location, max(tempr_list), min(tempr_list)
should fix the error.
Can Anybody help how to find Average Along with maximum and Minimum in this problem?