pyspark to pandas dataframe: datetime compatability
Question:
I am using pyspark to do most of the data wrangling but at the end I need to convert to pandas dataframe. When converting columns that I have formatted to date become "object" dtype in pandas.
Are datetimes between pyspark and pandas incompatible? How can I keep dateformat after pyspark -> pandas dataframe convertion ?
EDIT: converting to timestamp is a workaround as suggested in other question. How can I find out more on data type compatability between pyspark vs pandas ? There is not much info on documentation
Answers:
Checkout the spark documentation, it is more informative than the databricks documentation you linked in the question.
I think the cleanest solution is to use timestamp
rather than date
type in your spark code as you said.
The other way to do it (which I wouldn’t recommend) would be to convert from object
back to datetime
in the pandas dataframe using the pandas to_datetime
function. Something like this
> object_series = pd.Series(["2022-01-01", "2022-01-02"])
> df = pd.DataFrame({'dates':object_series})
> df.dtypes
dates object
dtype: object
> df = df.assign(dates_2=pd.to_datetime(df.dates))
> df.dtypes
dates object
dates_2 datetime64[ns]
dtype: object
> df
dates dates_2
0 2022-01-01 2022-01-01
1 2022-01-02 2022-01-02
I am using pyspark to do most of the data wrangling but at the end I need to convert to pandas dataframe. When converting columns that I have formatted to date become "object" dtype in pandas.
Are datetimes between pyspark and pandas incompatible? How can I keep dateformat after pyspark -> pandas dataframe convertion ?
EDIT: converting to timestamp is a workaround as suggested in other question. How can I find out more on data type compatability between pyspark vs pandas ? There is not much info on documentation
Checkout the spark documentation, it is more informative than the databricks documentation you linked in the question.
I think the cleanest solution is to use timestamp
rather than date
type in your spark code as you said.
The other way to do it (which I wouldn’t recommend) would be to convert from object
back to datetime
in the pandas dataframe using the pandas to_datetime
function. Something like this
> object_series = pd.Series(["2022-01-01", "2022-01-02"])
> df = pd.DataFrame({'dates':object_series})
> df.dtypes
dates object
dtype: object
> df = df.assign(dates_2=pd.to_datetime(df.dates))
> df.dtypes
dates object
dates_2 datetime64[ns]
dtype: object
> df
dates dates_2
0 2022-01-01 2022-01-01
1 2022-01-02 2022-01-02