How to get distinct keys as a list from an RDD in pyspark?

Question:

Here is some example data turned into an RDD:

my_data = [{'id': '001', 'name': 'Sam', 'class': "classA", 'age': 15, 'exam_score': '90'},
           {'id': '002', 'name': 'Tom', 'class': "classA", 'age': 15, 'exam_score': '78'},
           {'id': '003', 'name': 'Ben', 'class': "classB", 'age': 16, 'exam_score': '91'},
           {'id': '004', 'name': 'Max', 'class': "classB", 'age': 16, 'exam_score': '76'},
           {'id': '005', 'name': 'Ana', 'class': "classA", 'age': 15, 'exam_score': '88'},
           {'id': '006', 'name': 'Ivy', 'class': "classA", 'age': 16, 'exam_score': '77'},
           {'id': '007', 'name': 'Eva', 'class': "classB", 'age': 15, 'exam_score': '86'},
           {'id': '008', 'name': 'Zoe', 'class': "classB", 'age': 16, 'exam_score': '89'}]

my_rdd = sc.parallelize(my_data)

Running my_rdd by itself returns:

#>>> ParallelCollectionRDD[117] at readRDDFromFile at PythonRDD.scala:274

And I know you can display the RDD with my_rdd.collect() which returns:

#[{'age': 15,
#  'class': 'classA',
#  'exam_score': '90',
#  'id': '001',
#  'name': 'Sam'},
# {'age': 15,
#  'class': 'classA',
#  'exam_score': '78',
#  'id': '002',
#  'name': 'Tom'}, ...]

I have found that I can access the keys by running my_rdd.keys(), but this returns:

#>>> PythonRDD[121] at RDD at PythonRDD.scala:53

I want to return a list of all the distinct keys (I know the keys are the same for each line but for a scenario where they aren’t I would like to to know) in the RDD – so something that looks like this:

#>>> ['id', 'name', 'class', 'age', 'exam_score']

So with this I assumed I could get this by running my_rdd.keys().distinct.collect() but I get an error instead.

I’m still learning pyspark so I would really appreciate if someone could provide some input 🙂

Asked By: cinnamon662

||

Answers:

my_data = [{'id': '001', 'name': 'Sam', 'class': "classA", 'age': 15, 'exam_score': '90'},
           {'id': '008', 'name': 'Zoe', 'xxxxx': "classB", 'age': 16, 'exam_score': '89'},
           {'id': '007', 'name': 'Eva', 'class': "classB", 'age': 15, 'exam_score': '86'},
        ]

my_rdd = sc.parallelize(my_data)

key_rdd = my_rdd.flatMap(lambda x: x) # flatMap

print( key_rdd.collect() )
# ['id', 'name', 'class', 'age', 'exam_score', 'id', 'name', 'xxxxx', 'age', 'exam_score', 'id', 'name', 'class', 'age', 'exam_score']
print( key_rdd.distinct().collect() )
# ['class', 'id', 'exam_score', 'age', 'name', 'xxxxx']
Answered By: Linus