pyspark to pandas dataframe: datetime compatability

Question:

I am using pyspark to do most of the data wrangling but at the end I need to convert to pandas dataframe. When converting columns that I have formatted to date become "object" dtype in pandas.

Are datetimes between pyspark and pandas incompatible? How can I keep dateformat after pyspark -> pandas dataframe convertion ?

EDIT: converting to timestamp is a workaround as suggested in other question. How can I find out more on data type compatability between pyspark vs pandas ? There is not much info on documentation

Asked By: euh

||

Answers:

Checkout the spark documentation, it is more informative than the databricks documentation you linked in the question.

I think the cleanest solution is to use timestamp rather than date type in your spark code as you said.

The other way to do it (which I wouldn’t recommend) would be to convert from object back to datetime in the pandas dataframe using the pandas to_datetime function. Something like this

> object_series = pd.Series(["2022-01-01", "2022-01-02"])

> df = pd.DataFrame({'dates':object_series})

> df.dtypes 
dates    object
dtype: object

> df = df.assign(dates_2=pd.to_datetime(df.dates))

> df.dtypes 
dates              object
dates_2    datetime64[ns]
dtype: object

> df 
        dates    dates_2
0  2022-01-01 2022-01-01
1  2022-01-02 2022-01-02
Answered By: Ben Jeffrey
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.