spark dataframe convert a few flattened columns to one array of struct column

Question:

I’d like to have some guidance what functions in spark dataframe together with scala/python code to achieve this transformation.

given a dataframe which has below columns

columnA, columnB, columnA1, ColumnB1, ColumnA2, ColumnB2 .... ColumnA10, ColumnB10
eg.
Fat Value, Fat Measure, Salt Value, Salt Measure, Iron Value, Iron Measure
10, mg, 2 mg etc etc

I’d like to convert it to a column which has type : array of struct.
eg.

type=Fat
amount=10
measure=mg

type = Salt
amount=2
measure=mg

Answers:

The structure of your data is a bit problematic because it seems like your first row are the category names, while the remaining rows are the values for each category. I don’t know how you’re loading your data, but I would probably load the first row as a header if possible – however there is a workaround to set the first row of your data as the header. This will make it much easier to construct your F.struct objects because you can pass the columns as string literals instead of the first row.

Here is a sample pyspark dataframe similar to yours:

+---------+-----------+----------+------------+----------+------------+
|  columnA|    columnB|  columnA1|    columnB1|  columnA2|    columnB2|
+---------+-----------+----------+------------+----------+------------+
|Fat Value|Fat Measure|Salt Value|Salt Measure|Iron Value|Iron Measure|
|       10|         mg|         2|          mg|         1|          mg|
|       20|         mg|        22|          mg|        12|          mg|
+---------+-----------+----------+------------+----------+------------+

And here is the modified one:

new_schema = [x for x in df.collect()[0]]
df2 = spark.createDataFrame(df.tail(df.count()-1), new_schema)

+---------+-----------+----------+------------+----------+------------+
|Fat Value|Fat Measure|Salt Value|Salt Measure|Iron Value|Iron Measure|
+---------+-----------+----------+------------+----------+------------+
|       10|         mg|         2|          mg|         1|          mg|
|       20|         mg|        22|          mg|        12|          mg|
+---------+-----------+----------+------------+----------+------------+

We can create column groups from each consecutive pair of columns [["Fat Value","Fat Measure],["Salt Value","Salt Measure"], ...] assuming that the desired information comes in pairs of columns. Then we can create the struct using the information in the columns as well as the column names.

col_groups = [[df2.columns[2*i], df2.columns[2*i+1]] for i in range(int(len(df2.columns)/2))]
# [['Fat Value', 'Fat Measure'],
#  ['Salt Value', 'Salt Measure'],
#  ['Iron Value', 'Iron Measure']]

df3 = df2.select(
    [
        F.struct(
            F.col(cols[0]).alias(f"Value"), 
            F.col(cols[1]).alias(f"Measure"),
            F.lit(cols[0].split(" ")[0]).alias(f"Type"),
        ).alias(cols[0].split(" ")[0] + " info")
        for i,cols in enumerate(col_groups)
    ]
)

+-------------+--------------+--------------+
|     Fat info|     Salt info|     Iron info|
+-------------+--------------+--------------+
|{10, mg, Fat}| {2, mg, Salt}| {1, mg, Iron}|
|{20, mg, Fat}|{22, mg, Salt}|{12, mg, Iron}|
+-------------+--------------+--------------+

And below is the schema of df3:

root
 |-- Fat info: struct (nullable = false)
 |    |-- Value: string (nullable = true)
 |    |-- Measure: string (nullable = true)
 |    |-- Type: string (nullable = false)
 |-- Salt info: struct (nullable = false)
 |    |-- Value: string (nullable = true)
 |    |-- Measure: string (nullable = true)
 |    |-- Type: string (nullable = false)
 |-- Iron info: struct (nullable = false)
 |    |-- Value: string (nullable = true)
 |    |-- Measure: string (nullable = true)
 |    |-- Type: string (nullable = false)
Answered By: Derek O