PySpark problem flattening array with nested JSON and other elements

Question:

I’m struggling with the correct syntax to flatten some data.

I have a dlt table with a column (named lorem for the sake of the example) where each row looks like this:

[{"field1": {"field1_1": null, "field1_2": null}, 
  "field2": "blabla", "field3": 13209914, 
  "field4": {"field4_1": null, "field4_2": null}, "field5": 4}, ...
]

I want my output to create a new table based on the first that basically creates a row per each element in the array I shared above.

Table should look like:

|field1_1|field1_2|field2|field3|field4_1|field4_2|field_5|
|:-------|:-------|:-----|:-----|:-------|:-------|:------|
|null|null|blabla|13209914|null|null|4|

However when I explode like: select(explode("lorem")) I do not get the wanted output, instead I get only field 0 and exploded and the other fields except everything inside field4.

My question is, in what other way should I be flattening this data?
I can provide a clearer example if needed.

Asked By: António Mendes

||

Answers:

Use withColumn to add the additional columns you need. A simple example:

%%pyspark
from pyspark.sql.functions import col

df = spark.read.json("abfss://[email protected]/raw/flattenJson.json")

df2 = df 
    .withColumn("field4_1", col("field4.field4_1")) 
    .withColumn("field4_2", col("field4.field4_2"))

df2.show()

My results:

enter image description here

Answered By: wBob – MSFT