How to use Pandas_UDF function in Pyspark program

Question

I have a Pyspark dataframe with million records. It has a column with string persian date and need to convert it to miladi date.I tried several approuches, first I used UDF function in Python which did not
have good performance. Then I wrote UDF function in Scala and used its Jar in Pyspark program; but performace did not change very much. I searched and found that pandas_UDF has better speed;
so, I decided to use it, however, it did not work very well. I used Pandas_UDF in these ways:

First:

 import pandas as pd
 
 @pandas_udf('long', PandasUDFType.SCALAR)
 def f1(v: pd.Series) -> pd.Series:
   return v.map(lambda x: JalaliDate(int(str(x[1])[0:4]), int(str(x[1])[4:6]), int(str(x[1])[6:8])).to_gregorian())
 
 df.withColumn('date_miladi', f1(df.trx_date)).show()
 
 Error:  TypeError: 'decimal.Decimal' object is not subscriptable

Second:

 import pandas as pd
 from typing import Iterator
 @pandas_udf(DateType())
 def f1(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
 for date in iterator:
     return pd.Series(JalaliDate(int(str(date[1])[0:4]), int(str(date[1])[4:6]), int(str(date[1])[6:8])).to_gregorian())

 df.withColumn('date_miladi', f1(df.trx_date)).show()
 
 Error: TypeError: Return type of the user-defined function should be Pandas.Series, but is <class 'datetime.date'>

Thirth:

import pandas as pd
@pandas_udf('long', PandasUDFType.SCALAR) 
def f1(v: pd.Series) -> pd.Series:
   return v.map(lambda x: JalaliDate(int(str(x[1])[0:4]), int(str(x[1])[4:6]), int(str(x[1])[6:8])).to_gregorian())

df.withColumn('date_miladi', f1(df.trx_date)).show()    

Error: TypeError: 'decimal.Decimal' object is not subscriptable

Fourth:

import pandas as pd
@pandas_udf(DateType())
def f1(col1: pd.Series) -> pd.Series:
    return (JalaliDate(int(str(col1[1])[0:4]), int(str(col1[1])[4:6]), int(str(col1[1])[6:8])).to_gregorian())
    
df.withColumn('date_miladi', f1(df.trx_date)).show()


Error: Return type of the user-defined function should be Pandas.Series, but is <class 'datetime.date'>

Update:
I use iterate in this way, but it still has error:

@pandas_udf("string",PandasUDFType.SCALAR_ITER)
    def f1(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
        #  making empty Iterator list
        for date in iterator:
            print('type date:', type(date[1]))
            yield str(JalaliDate(int(str(date[1])[0:4]), int(str(date[1])[4:6]), int(str(date[1])[6:8])).to_gregorian())

 Error: AttributeError: 'str' object has no attribute 'isnull'

Dataframe is like this:

 +-----------+-------------+
 |id         | persian_date|
 +-----------+-------------+
 |13085178737| 14010901    |
 |13098336049| 14010901    |
 |13098486609| 14010901    |
 |13097770966| 14010901    |
 |13099744296| 14010901    |
 |13101233891| 14010901    |
 |13100358276| 14010901    |
 +-----------+-------------+

Result should be like this:

 +-----------+-------------+--------------+
 |id         | persian_date| date_miladi  |
 +-----------+-------------+--------------+
 |13085178737| 14010901    |2022-11-22    |
 |13098336049| 14010901    |2022-11-22    |
 |13098486609| 14010901    |2022-11-22    |
 |13097770966| 14010901    |2022-11-22    |
 |13099744296| 14010901    |2022-11-22    |
 |13101233891| 14010901    |2022-11-22    |
 |13100358276| 14010901    |2022-11-22    |
 +-----------+-------------+--------------+

Would you please guide me what is the correct way to use Pandas_UDF in Pyspark program?

Any help is really appreciated.

Asked By: M_Gh

||

Source

Answer 1

Solution

Import required modules

from typing import Iterator
from pyspark.sql import functions as F
from persiantools.jdatetime import JalaliDate

First define a utility function to parse persian_date to gregorian

def parse_date(s: str):
    s = str(s)
    return JalaliDate(*map(int, [s[:4], s[4:6], s[6:8]])).to_gregorian()

Now you can try two approaches although I would prefer approach 1 since your are not doing any heavy initialization in UDF so no point in using iterators:

Approach 1: Series UDF

@F.pandas_udf('date')
def parse_date_pdf(series: pd.Series) -> pd.Series:
    return series.map(parse_date)

Approach 2: Series iterator UDF

@F.pandas_udf('date')
def parse_date_pdf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
    # looping over iterator yields a pandas series
    for series in iterator:
        yield series.map(parse_date)

Result

df = df.withColumn('date_miladi', parse_date_pdf('persian_date'))
df.show()

+-----------+------------+-----------+
|         id|persian_date|date_miladi|
+-----------+------------+-----------+
|13085178737|    14010901| 2022-11-22|
|13098336049|    14010901| 2022-11-22|
|13098486609|    14010901| 2022-11-22|
|13097770966|    14010901| 2022-11-22|
|13099744296|    14010901| 2022-11-22|
|13101233891|    14010901| 2022-11-22|
|13100358276|    14010901| 2022-11-22|
+-----------+------------+-----------+

Answered By: Shubham Sharma

How to use Pandas_UDF function in Pyspark program

Question:

Answers:

Solution

Approach 1: Series UDF

Approach 2: Series iterator UDF

Result