How can I access data from a nested dynamic frame to properly format it in Pyspark?

Question:

I’ve uploaded some semi-structed data into AWS glue using a Dynamic frame. From the dynamic frame I just the payload element which I selected by executing the following code in a Glue notebook

df_p = df.select_fields(["payload"])

I’m trying to convert it to a spark dataframe by executing the following:

Spark_df = df_p.toDF()

Instead of providing me with a column for each element, I have one column that’s titled payload. How can I un-nest the data so I can have x amount of columns where the key is the column name and the value is a row in the dataframe?

Asked By: abent

||

Answers:

What you are looking for it’s called the explode function. It will unnest one layer.

In your case, you would apply it to the spark DF as follows:

from pyspark.sql.functions import explode

df_p = df.select_fields(["payload"])
spark_df = df_p.toDF()

exploded_df = spark_df.select(explode("payload"))

You might need to apply explode again if the content is nested several times, but that is the way to go. Let me know if it helps.

Answered By: Sergio