Pyspark DataFrame Function

Question:

The problem I was having is trying to convert the following code in Python to PySpark. I’m extremely new to PySpark but I have a column of float data and for each row I want to perform a calculation based on the floor function value of the data input into the row. I want to add the discounted values for every year leading up to the ‘ExpirationPeriod’ and output this value into a new ‘output’ column. This is my Python code:

df = pd.DataFrame({'ExpirationPeriod':[1.2,2.0,3.0,4.5]})

def sum_data(row):
    return sum([100/1.02**i + 5 for i in range(int(row['ExpirationPeriod']))])

df['output'] = df.apply(sum_data, axis=1)

And this is what my attempt in PySpark looks like:

from pyspark.sql.functions import udf, struct, col
from pyspark.sql.types import IntegerType, DoubleType

df = spark.createDataFrame(
    [(1.2, ),  # create your data here, be consistent in the types.
     (2.0, ),
     (3.0, ),
     (4.5, )],
    ["ExpirationPeriod"])

function = udf(lambda row: [sum([i for i in range(x)]) for x in row], DoubleType())

df = df.withColumn("Counts", df["ExpirationPeriod"].cast(IntegerType()))
new_df = df.withColumn("new", function(struct([df['Counts']])))

new_df.show()

I’m not sure what’s not working with my code or what I might have to change to mirror my code in Python

Asked By: Mike P.

||

Answers:

You can use your sum_data that work in Pandas directly in Spark with minor change even without using Pandas API in Spark:

from pyspark.sql.functions import udf, struct, col
from pyspark.sql.types import IntegerType, DoubleType

df = spark.createDataFrame(
    [(1.2, ),  # create your data here, be consistent in the types.
     (2.0, ),
     (3.0, ),
     (4.5, )],
    ["ExpirationPeriod"])

@udf(returnType=DoubleType())
def sum_data(row):
    return sum([100/1.02**i + 5 for i in range(row)])

df = df.withColumn("Counts", df["ExpirationPeriod"].cast(IntegerType()))
new_df = df.withColumn("new", sum_data(col('counts')))

new_df.show()
+----------------+------+------------------+
|ExpirationPeriod|Counts|               new|
+----------------+------+------------------+
|             1.2|     1|             105.0|
|             2.0|     2| 208.0392156862745|
|             3.0|     3|309.15609381007306|
|             4.5|     4| 408.3883272647775|
+----------------+------+------------------+

There are two reasons why your Spark code is not working:

  1. Passing struct([df['Counts']]) in your UDF: You only need to pass the required column, which is Counts, into your UDF. Therefore, only col('Counts') is needed.
  2. When you’re using UDF in Spark, for x in row is not worked since the value you passed is a integer.

Edit 1 on 2023-04-07

Decorator itself doesn’t have any impact to your function, as it’s just a design patten and provide a wrapping to your sum_data function. If I use your coding style it should be like:

new_df = df.withColumn(
    "new",
    func.udf(lambda row: sum([100/1.02**i + 5 for i in range(row)]), returnType=DoubleType())(func.col('counts'))
)

The real impact to your sum_data is the func.udf. Since Spark dataframe is a JVM structure and your function is implemented by Python. In order to process your logic, which is designed in Python language, serialization and deserialization is needed to move the data. Therefore, func.udf is needed when you want to perform customized transformation to your data in Spark dataframe.

Answered By: Jonathan Lam
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.