explode a pyspark column with root name intact

Question:

I have pyspark dataframe , schema looks like this:

|-- col1: timestamp (nullable = true)
 |-- col2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- NM: string (nullable = true)

How can I explode col2 so that final column name looks like col1, col2.NM etc

Asked By: ista120

||

Answers:

Update:

Since you have multiple such columns, you can create list of those columns and use the below:

cols_to_explode = ["col2","col3"]
other_cols = [F.col(c) for c in df.schema.names if c not in cols_to_explode]
struct_cols = list(chain(*[[F.col(col + "."+ c).alias(col+"_" + c) for c in df.withColumn(col, F.explode(col)).selectExpr(col+".*").columns] for col in df.schema.names if col in cols_to_explode]))

df 
.withColumn("asZipped", F.arrays_zip(*cols_to_explode))
.withColumn("asZipped", F.explode("asZipped"))
.select(other_cols+ [F.col("asZipped."+col).alias(col) for col in df.schema.names if col in cols_to_explode])
.select(other_cols+struct_cols)
.show(truncate=False)

Input:

Multi_col_input

Output:

Output


This would work

df 
.withColumn("col2", F.explode("col2"))
.select([F.col(c) for c in df.schema.names if c!="col2"]+[F.col("col2." + c).alias("col2_" + c) for c in df.withColumn("col2", F.explode("col2")).selectExpr("col2.*").columns])
.show()

Input DF:

Input

Output:

Output

Answered By: Ronak Jain
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.