rdd

Splitting a text file based on empty lines in Spark

Splitting a text file based on empty lines in Spark Question: I am working on a really big file which is a very large text document almost 2GBs. Something like this – #*MOSFET table look-up models for circuit simulation #t1984 #cIntegration, the VLSI Journal #index1 #*The verification of the protection mechanisms of high-level language machines …

Total answers: 1

How to group and count values in RDD to return a small summary using pyspark?

How to group and count values in RDD to return a small summary using pyspark? Question: Some example data: new_data = [{‘name’: ‘Tom’, ‘subject’: "maths", ‘exam_score’: 85}, {‘name’: ‘Tom’, ‘subject’: "science", ‘exam_score’: 55}, {‘name’: ‘Tom’, ‘subject’: "history", ‘exam_score’: 68}, {‘name’: ‘Ivy’, ‘subject’: "maths", ‘exam_score’: 72}, {‘name’: ‘Ivy’, ‘subject’: "science", ‘exam_score’: 67}, {‘name’: ‘Ivy’, ‘subject’: "history", …

Total answers: 2

How to filter RDD by attribute/key and then apply function using pyspark?

How to filter RDD by attribute/key and then apply function using pyspark? Question: I have some example data: my_data = [{‘id’: ‘001’, ‘name’: ‘Sam’, ‘class’: "classA", ‘age’: 15, ‘exam_score’: 90}, {‘id’: ‘002’, ‘name’: ‘Tom’, ‘class’: "classA", ‘age’: 15, ‘exam_score’: 78}, {‘id’: ‘003’, ‘name’: ‘Ben’, ‘class’: "classB", ‘age’: 16, ‘exam_score’: 91}, {‘id’: ‘004’, ‘name’: ‘Max’, ‘class’: …

Total answers: 2

How to get distinct keys as a list from an RDD in pyspark?

How to get distinct keys as a list from an RDD in pyspark? Question: Here is some example data turned into an RDD: my_data = [{‘id’: ‘001’, ‘name’: ‘Sam’, ‘class’: "classA", ‘age’: 15, ‘exam_score’: ’90’}, {‘id’: ‘002’, ‘name’: ‘Tom’, ‘class’: "classA", ‘age’: 15, ‘exam_score’: ’78’}, {‘id’: ‘003’, ‘name’: ‘Ben’, ‘class’: "classB", ‘age’: 16, ‘exam_score’: ’91’}, …

Total answers: 1

pyspark- how to add a column to spark dataframe from a list

pyspark- how to add a column to spark dataframe from a list Question: I’m looking for a way to add a new column in a Spark DF from a list. In pandas approach it is very easy to deal with it but in spark it seems to be relatively difficult. Please find an examp #pandas …

Total answers: 1

Pyspark rdd : 'RDD' object has no attribute 'flatmap'

Pyspark rdd : 'RDD' object has no attribute 'flatmap' Question: I am new to Pyspark and I am actually trying to build a flatmap out of a Pyspark RDD object. However, even if this function clearly exists for pyspark RDD class, according to the documentation, I can’t manage to use it and get the following …

Total answers: 1

How to extract an element from a array in pyspark

How to extract an element from an array in PySpark Question: I have a data frame with following type: col1|col2|col3|col4 xxxx|yyyy|zzzz|[1111],[2222] I want my output to be of the following type: col1|col2|col3|col4|col5 xxxx|yyyy|zzzz|1111|2222 My col4 is an array, and I want to convert it into a separate column. What needs to be done? I saw …

Total answers: 2

Spark union of multiple RDDs

Spark union of multiple RDDs Question: In my pig code I do this: all_combined = Union relation1, relation2, relation3, relation4, relation5, relation 6. I want to do the same with spark. However, unfortunately, I see that I have to keep doing it pairwise: first = rdd1.union(rdd2) second = first.union(rdd3) third = second.union(rdd4) # …. and …

Total answers: 3

Spark RDD – Mapping with extra arguments

Spark RDD – Mapping with extra arguments Question: Is it possible to pass extra arguments to the mapping function in pySpark? Specifically, I have the following code recipe: raw_data_rdd = sc.textFile(“data.json”, use_unicode=True) json_data_rdd = raw_data_rdd.map(lambda line: json.loads(line)) mapped_rdd = json_data_rdd.flatMap(processDataLine) The function processDataLine takes extra arguments in addition to the JSON object, as: def processDataLine(dataline, …

Total answers: 1

'PipelinedRDD' object has no attribute 'toDF' in PySpark

'PipelinedRDD' object has no attribute 'toDF' in PySpark Question: I’m trying to load an SVM file and convert it to a DataFrame so I can use the ML module (Pipeline ML) from Spark. I’ve just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured). My my_script.py is: from pyspark.mllib.util import MLUtils from …

Total answers: 2