PySpark: How to extract variables from a struct nested in a struct inside an array?
Question:
The following is a toy example that is a subset of my actual data’s schema. I abbreviated it for brevity.
I am looking to build a PySpark dataframe that contains 3 fields: ID
, Type
and TIMESTAMP
that I would then save as a Hive Table. I am struggling with the PySpark code to extract the relevant columns.
|-- Records: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FileID: long (nullable = true)
| | |-- SrcFields: struct (nullable = true)
| | | |-- ID: string (nullable = true)
| | | |-- Type: string (nullable = true)
| | | |-- TIMESTAMP: string (nullable = true)
Thus far, I imagine my solution should look something like:
from pyspark.sql.functions import col, explode
df.withColumn("values", explode("values")).select(
"*", col("values")["name"].alias("name"), col("values")["id"].alias("id")
)
However, the solution above doesn’t account for the extra nesting of my use-case and I’m unable to figure out the additional syntax required.
Answers:
In PySpark you can access subfields of a struct using dot notation. So something like this should work:
- Explode the array
- Use the dot notation to get the subfields of struct
(
df.withColumn("values", explode("Records"))
.select(
col("values.SrcFields.ID").alias("id"),
col("values.SrcFields.Type").alias("type"),
col("values.SrcFields.TIMESTAMP").alias("timestamp")
)
)
The following is a toy example that is a subset of my actual data’s schema. I abbreviated it for brevity.
I am looking to build a PySpark dataframe that contains 3 fields: ID
, Type
and TIMESTAMP
that I would then save as a Hive Table. I am struggling with the PySpark code to extract the relevant columns.
|-- Records: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FileID: long (nullable = true)
| | |-- SrcFields: struct (nullable = true)
| | | |-- ID: string (nullable = true)
| | | |-- Type: string (nullable = true)
| | | |-- TIMESTAMP: string (nullable = true)
Thus far, I imagine my solution should look something like:
from pyspark.sql.functions import col, explode
df.withColumn("values", explode("values")).select(
"*", col("values")["name"].alias("name"), col("values")["id"].alias("id")
)
However, the solution above doesn’t account for the extra nesting of my use-case and I’m unable to figure out the additional syntax required.
In PySpark you can access subfields of a struct using dot notation. So something like this should work:
- Explode the array
- Use the dot notation to get the subfields of struct
(
df.withColumn("values", explode("Records"))
.select(
col("values.SrcFields.ID").alias("id"),
col("values.SrcFields.Type").alias("type"),
col("values.SrcFields.TIMESTAMP").alias("timestamp")
)
)