Does Pyspark Pandas support Pandas pct_change function?

Question:

I saw that pct_change function is partially implemented with the missing of some parameters.

https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html

Yet, when I tried

data_pd = data.toPandas
data_pd.pct_change()

, there was AttributeError: 'function' object has no attribute 'pct_change'

I want to know whether it is not implemented yet. If no, what is the correct way to use pct_change function in pyspark pandas API? Thank you

Asked By: Dicer

||

Answers:

You can implement pct_change() function on Pyspark pandas Dataframe or Pyspark pandas Series. The error, however, indicates that the pct_change() has been used on a function object.

The following is a demonstration of how you can use this function.

  • Using Pyspark pandas Dataframe:
from pyspark import pandas

df = pandas.DataFrame([[10, 18, 11], [20, 15, 8], [30, 20, 3]])
print(type(df))
print(df.pct_change())

enter image description here

  • Using Pyspark pandas Series:
data = pandas.Series([90, 91, 85], index=[2, 4, 1])
print(type(data))
print(data.pct_change())

enter image description here

UPDATE:

  • The error occurs because, using DataFrame.toPandas is different from DataFrame.toPandas().

  • In this case, when you use data.toPandas it returns an object of type method. When you try to use pct_change() on this object, it is giving error.

enter image description here

  • Using DataFrame.toPandas() would return a DataFrame object on which you can use pct_change(). So modify the code as following to achieve the requirement.
data_pd = data.toPandas()
print(type(data_pd))

op = data_pd.pct_change()
print(op)

enter image description here

Answered By: Saideep Arikontham

After having a chat with @Saideep Arikontham, we find that pandas_api() can solve the problem.

    #Covert Spark Dataframe to Spark Pandas Dataframe 
    data_pd = data.pandas_api()

    data_pd.pct_change()
Answered By: Dicer