Reduce a key-value pair into a key-list pair with Apache Spark

Question:

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the reduceByKey function with something of the flavor:

My_KMV = My_KV.reduce(lambda a, b: a.append([b]))

The error that I get when this occurs is:

‘NoneType’ object has no attribue ‘append’.

My keys are integers and values V1,…,Vn are tuples. My goal is to create a single pair with the key and a list of the values (tuples).

Asked By: TravisJ

||

Answers:

Map and ReduceByKey

Input type and output type of reduce must be the same, therefore if you want to aggregate a list, you have to map the input to lists. Afterwards you combine the lists into one list.

Combining lists

You’ll need a method to combine lists into one list. Python provides some methods to combine lists.

append modifies the first list and will always return None.

x = [1, 2, 3]
x.append([4, 5])
# x is [1, 2, 3, [4, 5]]

extend does the same, but unwraps lists:

x = [1, 2, 3]
x.extend([4, 5])
# x is [1, 2, 3, 4, 5]

Both methods return None, but you’ll need a method that returns the combined list, therefore just use the plus sign.

x = [1, 2, 3] + [4, 5]
# x is [1, 2, 3, 4, 5]

Spark

file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) 
         .map(lambda actor: (actor.split(",")[0], actor))  

         # transform each value into a list
         .map(lambda nameTuple: (nameTuple[0], [ nameTuple[1] ])) 

         # combine lists: ([1,2,3] + [4,5]) becomes [1,2,3,4,5]
         .reduceByKey(lambda a, b: a + b)

CombineByKey

It’s also possible to solve this with combineByKey, which is used internally to implement reduceByKey, but it’s more complex and “using one of the specialized per-key combiners in Spark can be much faster”. Your use case is simple enough for the upper solution.

GroupByKey

It’s also possible to solve this with groupByKey, but it reduces parallelization and therefore could be much slower for big data sets.

Answered By: Christian Strempfer

Ok. I hope, I got this right. Your input is something like this:

kv_input = [("a", 1), ("a", 2), ("a", 3), ("b", 1), ("b", 5)]

and you want to get something like this:

kmv_output = [("a", [1, 2, 3]), ("b", [1, 5])]

Then this might do the job (see here):

d = dict()
for k, v in kv_input:
    d.setdefault(k, list()).append(v)
kmv_output = list(d.items())

If I got this wrong, please tell me, so I might adjust this to your needs.

P.S.: a.append([b]) returns always None. You might want to observe either [b] or a but not the result of append.

Answered By: Dave J

If you want to do a reduceByKey where the type in the reduced KV pairs is different than the type in the original KV pairs, then one can use the function combineByKey. What the function does is take KV pairs and combine them (by Key) into KC pairs where C is a different type than V.

One specifies 3 functions, createCombiner, mergeValue, mergeCombiners. The first specifies how to transform a type V into a type C, the second describes how to combine a type C with a type V, and the last specifies how to combine a type C with another type C. My code creates the K-V pairs:

Define the 3 functions as follows:

def Combiner(a):    #Turns value a (a tuple) into a list of a single tuple.
    return [a]

def MergeValue(a, b): #a is the new type [(,), (,), ..., (,)] and b is the old type (,)
    a.extend([b])
    return a

def MergeCombiners(a, b): #a is the new type [(,),...,(,)] and so is b, combine them
    a.extend(b)
    return a

Then, My_KMV = My_KV.combineByKey(Combiner, MergeValue, MergeCombiners)

The best resource I found on using this function is: http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/

As others have pointed out, a.append(b) or a.extend(b) return None. So the reduceByKey(lambda a, b: a.append(b)) returns None on the first pair of KV pairs, then fails on the second pair because None.append(b) fails. You could work around this by defining a separate function:

 def My_Extend(a,b):
      a.extend(b)
      return a

Then call reduceByKey(lambda a, b: My_Extend(a,b)) (The use of the lambda function here may be unnecessary, but I have not tested this case.)

Answered By: TravisJ

I’m kind of late to the conversation, but here’s my suggestion:

>>> foo = sc.parallelize([(1, ('a','b')), (2, ('c','d')), (1, ('x','y'))])
>>> foo.map(lambda (x,y): (x, [y])).reduceByKey(lambda p,q: p+q).collect()
[(1, [('a', 'b'), ('x', 'y')]), (2, [('c', 'd')])]
Answered By: alreich

I hit this page while looking for java example for the same problem. (If your case is similar, here is my example)

The trick is – You need to group for keys.

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.StreamSupport;

public class SparkMRExample {

    public static void main(String[] args) {
        // spark context initialisation
        SparkConf conf = new SparkConf()
                .setAppName("WordCount")
                .setMaster("local");
        JavaSparkContext context = new JavaSparkContext(conf);

        //input for testing;
        List<String> input = Arrays.asList("Lorem Ipsum is simply dummy text of the printing and typesetting industry.",
                "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.",
                "It has survived not only for centuries, but also the leap into electronic typesetting, remaining essentially unchanged.",
                "It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing");
        JavaRDD<String> inputRDD = context.parallelize(input);


        // the map phase of word count example
        JavaPairRDD<String, Integer> mappedRDD =
                inputRDD.flatMapToPair( line ->                      // for this input, each string is a line
                        Arrays.stream(line.split("\s+"))            // splitting into words, converting into stream
                                .map(word -> new Tuple2<>(word, 1))  // each word is assigned with count 1
                                .collect(Collectors.toList()));      // stream to iterable

        // group the tuples by key
        // (String,Integer) -> (String, Iterable<Integer>)
        JavaPairRDD<String, Iterable<Integer>> groupedRDD = mappedRDD.groupByKey();

        // the reduce phase of word count example
        //(String, Iterable<Integer>) -> (String,Integer)
        JavaRDD<Tuple2<String, Integer>> resultRDD =
                groupedRDD.map(group ->                                      //input is a tuple (String, Iterable<Integer>)
                        new Tuple2<>(group._1,                              // the output key is same as input key
                        StreamSupport.stream(group._2.spliterator(), true)  // converting to stream
                                .reduce(0, (f, s) -> f + s)));              // the sum of counts
        //collecting the RRD so that we can print
        List<Tuple2<String, Integer>> result = resultRDD.collect();
        // print each tuple
        result.forEach(System.out::println);
    }
}
Answered By: Thamme Gowda

You can use the RDD groupByKey method.

Input:

data = [(1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')]
rdd = sc.parallelize(data)
result = rdd.groupByKey().collect()

Output:

[(1, ['a', 'b']), (2, ['c', 'd', 'e']), (3, ['f'])]
Answered By: Marius Ion

The error message stems from the type for ‘a’ in your closure.

 My_KMV = My_KV.reduce(lambda a, b: a.append([b]))

Let pySpark explicitly evaluate a as a list. For instance,

My_KMV = My_KV.reduceByKey(lambda a,b:[a].extend([b]))

In many cases, reduceByKey will be preferable to groupByKey, refer to:
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

Answered By: Seung-Hwan Lim

I tried with combineByKey ,here are my steps

combineddatardd=sc.parallelize([("A", 3), ("A", 9), ("A", 12),("B", 4), ("B", 10), ("B", 11)])

combineddatardd.combineByKey(lambda v:[v],lambda x,y:x+[y],lambda x,y:x+y).collect()

Output:

[('A', [3, 9, 12]), ('B', [4, 10, 11])]
  1. Define a function for combiner which sets accumulator to first key value pair which it encounters inside the partition convert the value to list in this step

  2. Define a function which mergers the new value of the same key to the accumulator value captured in step 1 Note:-convert the value to list in this function as accumulator value was converted to list in first step

  3. Define function to merge combiners outputs of individual partitions.

Answered By: krishna rachur

tl;dr If you really require operation like this use groupByKey as suggested by @MariusIon. Every other solution proposed here is either bluntly inefficient are at least suboptimal compared to direct grouping.

reduceByKey with list concatenation is not an acceptable solution because:

  • Requires initialization of O(N) lists.
  • Each application of + to a pair of lists requires full copy of both lists (O(N)) effectively increasing overall complexity to O(N2).
  • Doesn’t address any of the problems introduced by groupByKey. Amount of data that has to be shuffled as well as the size of the final structure are the same.
  • Unlike suggested by one of the answers there is no difference in a level of parallelism between implementation using reduceByKey and groupByKey.

combineByKey with list.extend is a suboptimal solution because:

  • Creates O(N) list objects in MergeValue (this could be optimized by using list.append directly on the new item).
  • If optimized with list.append it is exactly equivalent to an old (Spark <= 1.3) implementation of a groupByKey and ignores all the optimizations introduced by SPARK-3074 which enables external (on-disk) grouping of the larger-than-memory structures.
Answered By: zero323

I hope you have input data like this

10 1
10 2
20 4
20 7
20 9

And you want the output something like this

10-1,2
20-4,7,9

You can do something like this

rdd=sc.textFile("location_of_file") 

def parse(line):
    fields=line.split(" ")
    return (fields[0],fields[1])

rdd1=rdd.map(parse) //parse func is for having the input as key,value pair
rdd1.groupByKey().mapValues(list).collect()
Answered By: Chandan Kumar Sahu