How to unstack a column to create multiple columns out of it in pyspark?

Question

I have csv file which contains data in below format

row_num	classes
1	0:0.2,1:0.3,2:0.5
2	0:0.1,1:0.5:2:0.4
3	0:0.4,1:0.5:2:0.1
4	0:0.2,1:0.4:2:0.4

I want it to be converted as follows:

row_num	class_0	class_1	class_2
1	0.2	0.3	0.5
2	0.1	0.5	0.4
3	0.4	0.5	0.1
4	0.2	0.4	0.4

Please help me with this transformation using pyspark.

Asked By: suraj jadhav

||

Source

Answer 1

TLDR-

df.select("row", F.explode(F.split("classes",",")).alias("keyValue")).select("row", F.split("keyValue",":")[0].alias("key"), F.split("keyValue",":")[1].alias("value")).groupBy("row").pivot("key").agg(F.first("value")).show()

Output

Essentially,

Split the column to form an array

df.select("row", F.split("classes",",").alias("as_array")).show(truncate=False)

Explode it to form rows for each element in the array corresponding to column "row_num"

df.select("row", F.explode("as_array").alias("asKeyValue"))

Split key and values as separate columns

df.select("row", F.split("asKeyValue",":")[0].alias("key"), F.split("asKeyValue",":")[1].alias("value"))

Pivot based on Row Number

Answered By: Ronak Jain

Answer 2

you can do transformation with this :

import csv

# Open the input CSV file
with open('input.csv', 'r') as input_file:
  # Create a CSV reader object
  reader = csv.reader(input_file)
  # Skip the header row
  next(reader)
  
  # Open the output CSV file
  with open('output.csv', 'w', newline='') as output_file:
    # Create a CSV writer object
    writer = csv.writer(output_file)
    # Write the header row
    writer.writerow(['row_num', 'class_0', 'class_1', 'class_2'])
    
    # Loop over the rows in the input file
    for row in reader:
      # Split the 'classes' field on ','
      class_values = row[1].split(',')
      # Convert the values to a dictionary
      class_dict = {int(x.split(':')[0]): float(x.split(':')[1]) for x in class_values}
      # Write the row to the output file
      writer.writerow([row[0], class_dict.get(0, 0.0), class_dict.get(1, 0.0), class_dict.get(2, 0.0)])

Result :

Answered By: Mohammed Chaaraoui

How to unstack a column to create multiple columns out of it in pyspark?

Question:

Answers: