Pyspark DataFrame Function
Question:
The problem I was having is trying to convert the following code in Python to PySpark. I’m extremely new to PySpark but I have a column of float data and for each row I want to perform a calculation based on the floor function value of the data input into the row. I want to add the discounted values for every year leading up to the ‘ExpirationPeriod’ and output this value into a new ‘output’ column. This is my Python code:
df = pd.DataFrame({'ExpirationPeriod':[1.2,2.0,3.0,4.5]})
def sum_data(row):
return sum([100/1.02**i + 5 for i in range(int(row['ExpirationPeriod']))])
df['output'] = df.apply(sum_data, axis=1)
And this is what my attempt in PySpark looks like:
from pyspark.sql.functions import udf, struct, col
from pyspark.sql.types import IntegerType, DoubleType
df = spark.createDataFrame(
[(1.2, ), # create your data here, be consistent in the types.
(2.0, ),
(3.0, ),
(4.5, )],
["ExpirationPeriod"])
function = udf(lambda row: [sum([i for i in range(x)]) for x in row], DoubleType())
df = df.withColumn("Counts", df["ExpirationPeriod"].cast(IntegerType()))
new_df = df.withColumn("new", function(struct([df['Counts']])))
new_df.show()
I’m not sure what’s not working with my code or what I might have to change to mirror my code in Python
Answers:
You can use your sum_data
that work in Pandas
directly in Spark with minor change even without using Pandas
API in Spark
:
from pyspark.sql.functions import udf, struct, col
from pyspark.sql.types import IntegerType, DoubleType
df = spark.createDataFrame(
[(1.2, ), # create your data here, be consistent in the types.
(2.0, ),
(3.0, ),
(4.5, )],
["ExpirationPeriod"])
@udf(returnType=DoubleType())
def sum_data(row):
return sum([100/1.02**i + 5 for i in range(row)])
df = df.withColumn("Counts", df["ExpirationPeriod"].cast(IntegerType()))
new_df = df.withColumn("new", sum_data(col('counts')))
new_df.show()
+----------------+------+------------------+
|ExpirationPeriod|Counts| new|
+----------------+------+------------------+
| 1.2| 1| 105.0|
| 2.0| 2| 208.0392156862745|
| 3.0| 3|309.15609381007306|
| 4.5| 4| 408.3883272647775|
+----------------+------+------------------+
There are two reasons why your Spark code is not working:
- Passing
struct([df['Counts']])
in your UDF: You only need to pass the required column, which is Counts
, into your UDF. Therefore, only col('Counts')
is needed.
- When you’re using UDF in Spark,
for x in row
is not worked since the value you passed is a integer.
Edit 1 on 2023-04-07
Decorator itself doesn’t have any impact to your function, as it’s just a design patten and provide a wrapping to your sum_data
function. If I use your coding style it should be like:
new_df = df.withColumn(
"new",
func.udf(lambda row: sum([100/1.02**i + 5 for i in range(row)]), returnType=DoubleType())(func.col('counts'))
)
The real impact to your sum_data
is the func.udf
. Since Spark dataframe is a JVM structure and your function is implemented by Python. In order to process your logic, which is designed in Python language, serialization and deserialization is needed to move the data. Therefore, func.udf
is needed when you want to perform customized transformation to your data in Spark dataframe.
The problem I was having is trying to convert the following code in Python to PySpark. I’m extremely new to PySpark but I have a column of float data and for each row I want to perform a calculation based on the floor function value of the data input into the row. I want to add the discounted values for every year leading up to the ‘ExpirationPeriod’ and output this value into a new ‘output’ column. This is my Python code:
df = pd.DataFrame({'ExpirationPeriod':[1.2,2.0,3.0,4.5]})
def sum_data(row):
return sum([100/1.02**i + 5 for i in range(int(row['ExpirationPeriod']))])
df['output'] = df.apply(sum_data, axis=1)
And this is what my attempt in PySpark looks like:
from pyspark.sql.functions import udf, struct, col
from pyspark.sql.types import IntegerType, DoubleType
df = spark.createDataFrame(
[(1.2, ), # create your data here, be consistent in the types.
(2.0, ),
(3.0, ),
(4.5, )],
["ExpirationPeriod"])
function = udf(lambda row: [sum([i for i in range(x)]) for x in row], DoubleType())
df = df.withColumn("Counts", df["ExpirationPeriod"].cast(IntegerType()))
new_df = df.withColumn("new", function(struct([df['Counts']])))
new_df.show()
I’m not sure what’s not working with my code or what I might have to change to mirror my code in Python
You can use your sum_data
that work in Pandas
directly in Spark with minor change even without using Pandas
API in Spark
:
from pyspark.sql.functions import udf, struct, col
from pyspark.sql.types import IntegerType, DoubleType
df = spark.createDataFrame(
[(1.2, ), # create your data here, be consistent in the types.
(2.0, ),
(3.0, ),
(4.5, )],
["ExpirationPeriod"])
@udf(returnType=DoubleType())
def sum_data(row):
return sum([100/1.02**i + 5 for i in range(row)])
df = df.withColumn("Counts", df["ExpirationPeriod"].cast(IntegerType()))
new_df = df.withColumn("new", sum_data(col('counts')))
new_df.show()
+----------------+------+------------------+
|ExpirationPeriod|Counts| new|
+----------------+------+------------------+
| 1.2| 1| 105.0|
| 2.0| 2| 208.0392156862745|
| 3.0| 3|309.15609381007306|
| 4.5| 4| 408.3883272647775|
+----------------+------+------------------+
There are two reasons why your Spark code is not working:
- Passing
struct([df['Counts']])
in your UDF: You only need to pass the required column, which isCounts
, into your UDF. Therefore, onlycol('Counts')
is needed. - When you’re using UDF in Spark,
for x in row
is not worked since the value you passed is a integer.
Edit 1 on 2023-04-07
Decorator itself doesn’t have any impact to your function, as it’s just a design patten and provide a wrapping to your sum_data
function. If I use your coding style it should be like:
new_df = df.withColumn(
"new",
func.udf(lambda row: sum([100/1.02**i + 5 for i in range(row)]), returnType=DoubleType())(func.col('counts'))
)
The real impact to your sum_data
is the func.udf
. Since Spark dataframe is a JVM structure and your function is implemented by Python. In order to process your logic, which is designed in Python language, serialization and deserialization is needed to move the data. Therefore, func.udf
is needed when you want to perform customized transformation to your data in Spark dataframe.