How to unstack a column to create multiple columns out of it in pyspark?

Question:

I have csv file which contains data in below format

row_num classes
1 0:0.2,1:0.3,2:0.5
2 0:0.1,1:0.5:2:0.4
3 0:0.4,1:0.5:2:0.1
4 0:0.2,1:0.4:2:0.4

I want it to be converted as follows:

row_num class_0 class_1 class_2
1 0.2 0.3 0.5
2 0.1 0.5 0.4
3 0.4 0.5 0.1
4 0.2 0.4 0.4

Please help me with this transformation using pyspark.

Asked By: suraj jadhav

||

Answers:

TLDR-

df.select("row", F.explode(F.split("classes",",")).alias("keyValue")).select("row", F.split("keyValue",":")[0].alias("key"), F.split("keyValue",":")[1].alias("value")).groupBy("row").pivot("key").agg(F.first("value")).show()

Output

Output

Essentially,

  1. Split the column to form an array

    df.select("row", F.split("classes",",").alias("as_array")).show(truncate=False)

AsArray

  1. Explode it to form rows for each element in the array corresponding to column "row_num"

df.select("row", F.explode("as_array").alias("asKeyValue"))

Explode

  1. Split key and values as separate columns

df.select("row", F.split("asKeyValue",":")[0].alias("key"), F.split("asKeyValue",":")[1].alias("value"))

SplitKV

  1. Pivot based on Row Number
Answered By: Ronak Jain

you can do transformation with this :

import csv

# Open the input CSV file
with open('input.csv', 'r') as input_file:
  # Create a CSV reader object
  reader = csv.reader(input_file)
  # Skip the header row
  next(reader)
  
  # Open the output CSV file
  with open('output.csv', 'w', newline='') as output_file:
    # Create a CSV writer object
    writer = csv.writer(output_file)
    # Write the header row
    writer.writerow(['row_num', 'class_0', 'class_1', 'class_2'])
    
    # Loop over the rows in the input file
    for row in reader:
      # Split the 'classes' field on ','
      class_values = row[1].split(',')
      # Convert the values to a dictionary
      class_dict = {int(x.split(':')[0]): float(x.split(':')[1]) for x in class_values}
      # Write the row to the output file
      writer.writerow([row[0], class_dict.get(0, 0.0), class_dict.get(1, 0.0), class_dict.get(2, 0.0)])

Result :

enter image description here

Answered By: Mohammed Chaaraoui