How to use Pandas_UDF function in Pyspark program
Question:
I have a Pyspark dataframe with million records. It has a column with string persian date and need to convert it to miladi date.I tried several approuches, first I used UDF function in Python which did not
have good performance. Then I wrote UDF function in Scala and used its Jar in Pyspark program; but performace did not change very much. I searched and found that pandas_UDF has better speed;
so, I decided to use it, however, it did not work very well. I used Pandas_UDF in these ways:
First:
import pandas as pd
@pandas_udf('long', PandasUDFType.SCALAR)
def f1(v: pd.Series) -> pd.Series:
return v.map(lambda x: JalaliDate(int(str(x[1])[0:4]), int(str(x[1])[4:6]), int(str(x[1])[6:8])).to_gregorian())
df.withColumn('date_miladi', f1(df.trx_date)).show()
Error: TypeError: 'decimal.Decimal' object is not subscriptable
Second:
import pandas as pd
from typing import Iterator
@pandas_udf(DateType())
def f1(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for date in iterator:
return pd.Series(JalaliDate(int(str(date[1])[0:4]), int(str(date[1])[4:6]), int(str(date[1])[6:8])).to_gregorian())
df.withColumn('date_miladi', f1(df.trx_date)).show()
Error: TypeError: Return type of the user-defined function should be Pandas.Series, but is <class 'datetime.date'>
Thirth:
import pandas as pd
@pandas_udf('long', PandasUDFType.SCALAR)
def f1(v: pd.Series) -> pd.Series:
return v.map(lambda x: JalaliDate(int(str(x[1])[0:4]), int(str(x[1])[4:6]), int(str(x[1])[6:8])).to_gregorian())
df.withColumn('date_miladi', f1(df.trx_date)).show()
Error: TypeError: 'decimal.Decimal' object is not subscriptable
Fourth:
import pandas as pd
@pandas_udf(DateType())
def f1(col1: pd.Series) -> pd.Series:
return (JalaliDate(int(str(col1[1])[0:4]), int(str(col1[1])[4:6]), int(str(col1[1])[6:8])).to_gregorian())
df.withColumn('date_miladi', f1(df.trx_date)).show()
Error: Return type of the user-defined function should be Pandas.Series, but is <class 'datetime.date'>
Update:
I use iterate
in this way, but it still has error:
@pandas_udf("string",PandasUDFType.SCALAR_ITER)
def f1(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
# making empty Iterator list
for date in iterator:
print('type date:', type(date[1]))
yield str(JalaliDate(int(str(date[1])[0:4]), int(str(date[1])[4:6]), int(str(date[1])[6:8])).to_gregorian())
Error: AttributeError: 'str' object has no attribute 'isnull'
Dataframe is like this:
+-----------+-------------+
|id | persian_date|
+-----------+-------------+
|13085178737| 14010901 |
|13098336049| 14010901 |
|13098486609| 14010901 |
|13097770966| 14010901 |
|13099744296| 14010901 |
|13101233891| 14010901 |
|13100358276| 14010901 |
+-----------+-------------+
Result should be like this:
+-----------+-------------+--------------+
|id | persian_date| date_miladi |
+-----------+-------------+--------------+
|13085178737| 14010901 |2022-11-22 |
|13098336049| 14010901 |2022-11-22 |
|13098486609| 14010901 |2022-11-22 |
|13097770966| 14010901 |2022-11-22 |
|13099744296| 14010901 |2022-11-22 |
|13101233891| 14010901 |2022-11-22 |
|13100358276| 14010901 |2022-11-22 |
+-----------+-------------+--------------+
Would you please guide me what is the correct way to use Pandas_UDF in Pyspark program?
Any help is really appreciated.
Answers:
Solution
Import required modules
from typing import Iterator
from pyspark.sql import functions as F
from persiantools.jdatetime import JalaliDate
First define a utility function to parse persian_date
to gregorian
def parse_date(s: str):
s = str(s)
return JalaliDate(*map(int, [s[:4], s[4:6], s[6:8]])).to_gregorian()
Now you can try two approaches although I would prefer approach 1 since your are not doing any heavy initialization in UDF so no point in using iterators:
Approach 1: Series UDF
@F.pandas_udf('date')
def parse_date_pdf(series: pd.Series) -> pd.Series:
return series.map(parse_date)
Approach 2: Series iterator UDF
@F.pandas_udf('date')
def parse_date_pdf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
# looping over iterator yields a pandas series
for series in iterator:
yield series.map(parse_date)
Result
df = df.withColumn('date_miladi', parse_date_pdf('persian_date'))
df.show()
+-----------+------------+-----------+
| id|persian_date|date_miladi|
+-----------+------------+-----------+
|13085178737| 14010901| 2022-11-22|
|13098336049| 14010901| 2022-11-22|
|13098486609| 14010901| 2022-11-22|
|13097770966| 14010901| 2022-11-22|
|13099744296| 14010901| 2022-11-22|
|13101233891| 14010901| 2022-11-22|
|13100358276| 14010901| 2022-11-22|
+-----------+------------+-----------+
I have a Pyspark dataframe with million records. It has a column with string persian date and need to convert it to miladi date.I tried several approuches, first I used UDF function in Python which did not
have good performance. Then I wrote UDF function in Scala and used its Jar in Pyspark program; but performace did not change very much. I searched and found that pandas_UDF has better speed;
so, I decided to use it, however, it did not work very well. I used Pandas_UDF in these ways:
First:
import pandas as pd
@pandas_udf('long', PandasUDFType.SCALAR)
def f1(v: pd.Series) -> pd.Series:
return v.map(lambda x: JalaliDate(int(str(x[1])[0:4]), int(str(x[1])[4:6]), int(str(x[1])[6:8])).to_gregorian())
df.withColumn('date_miladi', f1(df.trx_date)).show()
Error: TypeError: 'decimal.Decimal' object is not subscriptable
Second:
import pandas as pd
from typing import Iterator
@pandas_udf(DateType())
def f1(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for date in iterator:
return pd.Series(JalaliDate(int(str(date[1])[0:4]), int(str(date[1])[4:6]), int(str(date[1])[6:8])).to_gregorian())
df.withColumn('date_miladi', f1(df.trx_date)).show()
Error: TypeError: Return type of the user-defined function should be Pandas.Series, but is <class 'datetime.date'>
Thirth:
import pandas as pd
@pandas_udf('long', PandasUDFType.SCALAR)
def f1(v: pd.Series) -> pd.Series:
return v.map(lambda x: JalaliDate(int(str(x[1])[0:4]), int(str(x[1])[4:6]), int(str(x[1])[6:8])).to_gregorian())
df.withColumn('date_miladi', f1(df.trx_date)).show()
Error: TypeError: 'decimal.Decimal' object is not subscriptable
Fourth:
import pandas as pd
@pandas_udf(DateType())
def f1(col1: pd.Series) -> pd.Series:
return (JalaliDate(int(str(col1[1])[0:4]), int(str(col1[1])[4:6]), int(str(col1[1])[6:8])).to_gregorian())
df.withColumn('date_miladi', f1(df.trx_date)).show()
Error: Return type of the user-defined function should be Pandas.Series, but is <class 'datetime.date'>
Update:
I use iterate
in this way, but it still has error:
@pandas_udf("string",PandasUDFType.SCALAR_ITER)
def f1(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
# making empty Iterator list
for date in iterator:
print('type date:', type(date[1]))
yield str(JalaliDate(int(str(date[1])[0:4]), int(str(date[1])[4:6]), int(str(date[1])[6:8])).to_gregorian())
Error: AttributeError: 'str' object has no attribute 'isnull'
Dataframe is like this:
+-----------+-------------+
|id | persian_date|
+-----------+-------------+
|13085178737| 14010901 |
|13098336049| 14010901 |
|13098486609| 14010901 |
|13097770966| 14010901 |
|13099744296| 14010901 |
|13101233891| 14010901 |
|13100358276| 14010901 |
+-----------+-------------+
Result should be like this:
+-----------+-------------+--------------+
|id | persian_date| date_miladi |
+-----------+-------------+--------------+
|13085178737| 14010901 |2022-11-22 |
|13098336049| 14010901 |2022-11-22 |
|13098486609| 14010901 |2022-11-22 |
|13097770966| 14010901 |2022-11-22 |
|13099744296| 14010901 |2022-11-22 |
|13101233891| 14010901 |2022-11-22 |
|13100358276| 14010901 |2022-11-22 |
+-----------+-------------+--------------+
Would you please guide me what is the correct way to use Pandas_UDF in Pyspark program?
Any help is really appreciated.
Solution
Import required modules
from typing import Iterator
from pyspark.sql import functions as F
from persiantools.jdatetime import JalaliDate
First define a utility function to parse persian_date
to gregorian
def parse_date(s: str):
s = str(s)
return JalaliDate(*map(int, [s[:4], s[4:6], s[6:8]])).to_gregorian()
Now you can try two approaches although I would prefer approach 1 since your are not doing any heavy initialization in UDF so no point in using iterators:
Approach 1: Series UDF
@F.pandas_udf('date')
def parse_date_pdf(series: pd.Series) -> pd.Series:
return series.map(parse_date)
Approach 2: Series iterator UDF
@F.pandas_udf('date')
def parse_date_pdf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
# looping over iterator yields a pandas series
for series in iterator:
yield series.map(parse_date)
Result
df = df.withColumn('date_miladi', parse_date_pdf('persian_date'))
df.show()
+-----------+------------+-----------+
| id|persian_date|date_miladi|
+-----------+------------+-----------+
|13085178737| 14010901| 2022-11-22|
|13098336049| 14010901| 2022-11-22|
|13098486609| 14010901| 2022-11-22|
|13097770966| 14010901| 2022-11-22|
|13099744296| 14010901| 2022-11-22|
|13101233891| 14010901| 2022-11-22|
|13100358276| 14010901| 2022-11-22|
+-----------+------------+-----------+