How to display row as dictionary from pyspark dataframe?

Question:

Very new to pyspark.

I have 2 datasets, Events & Gadget. They look like so:

Events

[![enter image description here][1]][1]

Gadgets

[![enter image description here][2]][2]

I can read and join the 2 dataframes by using like so and present only the needed columns in my last line:

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType 
from pyspark.sql.types import ArrayType, DoubleType, BooleanType
from pyspark.sql.functions import col,array_contains

spark = SparkSession.builder.appName('PySpark Read CSV').getOrCreate()

# Reading csv file
events = spark.read.option("header",True).csv("events.csv")
events.printSchema()


gadgets = spark.read.option("header",True).csv("gadgets.csv")
gadgets.printSchema()


enrich = events.join(gadgets, events.deviceId == gadgets.ID).select(events["*"],gadgets["User"])

My assignment is asking that I present the data like so in the dictionary object:

Enrichment Tasks:

  • Enrich the event object with user data provided by the device.
  • Ensure the enriched event looks like the following:
{
    sessionId: string
    deviceId: string
    timestamp: timestamp
    type: emun(ADDED_TO_CART | APP_OPENED)
    total_price: 50.00
    user: string
}

I can handle the dtype changes and column name renaming that the assignment is asking for, however how do I deliver my results in the dictionary format above?

I am not sure how I can even show my results if I used this line:

enrich.rdd.map(lambda row: row.asDict())
Asked By: RustyShackleford

||

Answers:

Use the create_map() function to create (key, value) pair of each column and its value.

The create_map requires input in form (key1, value1, key2, value2, …). For that, use itertools.chain().

df = spark.createDataFrame(data=[["sess1","dev1","2022-12-19","emun(ADDED_TO_CART | APP_OPENED)","50.00","usr1"],["sess2","dev2","2022-12-18","emun(ADDED_TO_CART | APP_OPENED)","100.00","usr2"]], schema=["sessionId","deviceId","timestamp","type","total_price","user"])

import pyspark.sql.functions as F
import itertools

df = df.withColumn("map", 
                   F.create_map( 
                       list(itertools.chain( 
                           *((F.lit(x), F.col(x)) for x in df.columns) 
                       )) 
                   ))

df.show(truncate=False)

Output:

+---------+--------+----------+--------------------------------+-----------+----+----------------------------------------------------------------------------------------------------------------------------------------------+
|sessionId|deviceId|timestamp |type                            |total_price|user|map                                                                                                                                           |
+---------+--------+----------+--------------------------------+-----------+----+----------------------------------------------------------------------------------------------------------------------------------------------+
|sess1    |dev1    |2022-12-19|emun(ADDED_TO_CART | APP_OPENED)|50.00      |usr1|{sessionId -> sess1, deviceId -> dev1, timestamp -> 2022-12-19, type -> emun(ADDED_TO_CART | APP_OPENED), total_price -> 50.00, user -> usr1} |
|sess2    |dev2    |2022-12-18|emun(ADDED_TO_CART | APP_OPENED)|100.00     |usr2|{sessionId -> sess2, deviceId -> dev2, timestamp -> 2022-12-18, type -> emun(ADDED_TO_CART | APP_OPENED), total_price -> 100.00, user -> usr2}|
+---------+--------+----------+--------------------------------+-----------+----+----------------------------------------------------------------------------------------------------------------------------------------------+

You can also collect it as json using:

df = df.withColumn("json", F.to_json("map"))
Answered By: Azhar Khan
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.