Does Pyspark Pandas support Pandas pct_change function?
Question:
I saw that pct_change
function is partially implemented with the missing of some parameters.
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html
Yet, when I tried
data_pd = data.toPandas
data_pd.pct_change()
, there was AttributeError: 'function' object has no attribute 'pct_change'
I want to know whether it is not implemented yet. If no, what is the correct way to use pct_change
function in pyspark pandas API? Thank you
Answers:
You can implement pct_change()
function on Pyspark pandas Dataframe or Pyspark pandas Series. The error, however, indicates that the pct_change() has been used on a function
object.
The following is a demonstration of how you can use this function.
- Using Pyspark pandas Dataframe:
from pyspark import pandas
df = pandas.DataFrame([[10, 18, 11], [20, 15, 8], [30, 20, 3]])
print(type(df))
print(df.pct_change())
- Using Pyspark pandas Series:
data = pandas.Series([90, 91, 85], index=[2, 4, 1])
print(type(data))
print(data.pct_change())
UPDATE:
-
The error occurs because, using DataFrame.toPandas
is different from DataFrame.toPandas()
.
-
In this case, when you use data.toPandas
it returns an object of type method
. When you try to use pct_change()
on this object, it is giving error.
- Using
DataFrame.toPandas()
would return a DataFrame object on which you can use pct_change()
. So modify the code as following to achieve the requirement.
data_pd = data.toPandas()
print(type(data_pd))
op = data_pd.pct_change()
print(op)
After having a chat with @Saideep Arikontham, we find that pandas_api()
can solve the problem.
#Covert Spark Dataframe to Spark Pandas Dataframe
data_pd = data.pandas_api()
data_pd.pct_change()
I saw that pct_change
function is partially implemented with the missing of some parameters.
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html
Yet, when I tried
data_pd = data.toPandas
data_pd.pct_change()
, there was AttributeError: 'function' object has no attribute 'pct_change'
I want to know whether it is not implemented yet. If no, what is the correct way to use pct_change
function in pyspark pandas API? Thank you
You can implement pct_change()
function on Pyspark pandas Dataframe or Pyspark pandas Series. The error, however, indicates that the pct_change() has been used on a function
object.
The following is a demonstration of how you can use this function.
- Using Pyspark pandas Dataframe:
from pyspark import pandas
df = pandas.DataFrame([[10, 18, 11], [20, 15, 8], [30, 20, 3]])
print(type(df))
print(df.pct_change())
- Using Pyspark pandas Series:
data = pandas.Series([90, 91, 85], index=[2, 4, 1])
print(type(data))
print(data.pct_change())
UPDATE:
-
The error occurs because, using
DataFrame.toPandas
is different fromDataFrame.toPandas()
. -
In this case, when you use
data.toPandas
it returns an object of typemethod
. When you try to usepct_change()
on this object, it is giving error.
- Using
DataFrame.toPandas()
would return a DataFrame object on which you can usepct_change()
. So modify the code as following to achieve the requirement.
data_pd = data.toPandas()
print(type(data_pd))
op = data_pd.pct_change()
print(op)
After having a chat with @Saideep Arikontham, we find that pandas_api()
can solve the problem.
#Covert Spark Dataframe to Spark Pandas Dataframe
data_pd = data.pandas_api()
data_pd.pct_change()