PySpark problem flattening array with nested JSON and other elements
Question:
I’m struggling with the correct syntax to flatten some data.
I have a dlt
table with a column (named lorem
for the sake of the example) where each row looks like this:
[{"field1": {"field1_1": null, "field1_2": null},
"field2": "blabla", "field3": 13209914,
"field4": {"field4_1": null, "field4_2": null}, "field5": 4}, ...
]
I want my output to create a new table based on the first that basically creates a row per each element in the array I shared above.
Table should look like:
|field1_1|field1_2|field2|field3|field4_1|field4_2|field_5|
|:-------|:-------|:-----|:-----|:-------|:-------|:------|
|null|null|blabla|13209914|null|null|4|
However when I explode like: select(explode("lorem"))
I do not get the wanted output, instead I get only field 0 and exploded and the other fields except everything inside field4.
My question is, in what other way should I be flattening this data?
I can provide a clearer example if needed.
Answers:
Use withColumn
to add the additional columns you need. A simple example:
%%pyspark
from pyspark.sql.functions import col
df = spark.read.json("abfss://[email protected]/raw/flattenJson.json")
df2 = df
.withColumn("field4_1", col("field4.field4_1"))
.withColumn("field4_2", col("field4.field4_2"))
df2.show()
My results:
I’m struggling with the correct syntax to flatten some data.
I have a dlt
table with a column (named lorem
for the sake of the example) where each row looks like this:
[{"field1": {"field1_1": null, "field1_2": null},
"field2": "blabla", "field3": 13209914,
"field4": {"field4_1": null, "field4_2": null}, "field5": 4}, ...
]
I want my output to create a new table based on the first that basically creates a row per each element in the array I shared above.
Table should look like:
|field1_1|field1_2|field2|field3|field4_1|field4_2|field_5|
|:-------|:-------|:-----|:-----|:-------|:-------|:------|
|null|null|blabla|13209914|null|null|4|
However when I explode like: select(explode("lorem"))
I do not get the wanted output, instead I get only field 0 and exploded and the other fields except everything inside field4.
My question is, in what other way should I be flattening this data?
I can provide a clearer example if needed.
Use withColumn
to add the additional columns you need. A simple example:
%%pyspark
from pyspark.sql.functions import col
df = spark.read.json("abfss://[email protected]/raw/flattenJson.json")
df2 = df
.withColumn("field4_1", col("field4.field4_1"))
.withColumn("field4_2", col("field4.field4_2"))
df2.show()
My results: