Keep consistent dtype and timezone when concatenating with NaT in pandas

Question:

I have two pandas DataFrames containing time series that must be concatenated for further processing. One DataFrame contains localized timestamps while the other contains NaT in the time column. When concatenating, the column type changes from datetime64[ns] to object, hindering the further analysis.

My goal: keeping a localized time column, even after concatenation with NaT.

Example code

import pandas as pd

a = pd.DataFrame(
    {
        'DateTime': pd.date_range(
            start='2022-10-10',
            periods=7,
            freq='1D',
            tz='America/New_York'
        ),
        'Value': range(7)
    }
)
b = pd.DataFrame(
    {
        'DateTime': pd.NaT,
        'Value': range(10,20),
    }
)
c = pd.concat([a, b], axis=0, ignore_index=True)

The dtypes of a and b are different:

>>> print(a.dtypes)
DateTime    datetime64[ns, America/New_York]
Value                                  int64
dtype: object

>>> print(b.dtypes)
DateTime    datetime64[ns]
Value                int64
dtype: object

Since the timestamp for a is localized but the timestamp for b is not, the concatenation results in an object.

>>> print(c.dtypes)
DateTime    object
Value        int64
dtype: object

When trying to localize b, I get a TypeError:

>>> b['DateTime'] = b['DateTime'].tz_localize('America/New_York')
Traceback (most recent call last):
  File "/tmp/so-pandas-nat.py", line 27, in <module>
    b['DateTime'] = b['DateTime'].tz_localize('America/New_York')
  File ".venv/lib/python3.10/site-packages/pandas/core/generic.py", line 9977, in tz_localize
    ax = _tz_localize(ax, tz, ambiguous, nonexistent)
  File ".venv/lib/python3.10/site-packages/pandas/core/generic.py", line 9959, in _tz_localize
    raise TypeError(
TypeError: index is not a valid DatetimeIndex or PeriodIndex
Asked By: Phenyl

||

Answers:

Use Series.dt.tz_localize for processing column, if use Series.tz_localize it want processing DatetimeIndex, here raise error, becuse RangeIndex:

b['DateTime'] = b['DateTime'].dt.tz_localize('America/New_York')
c = pd.concat([a, b], axis=0, ignore_index=True)

print(c.dtypes)
DateTime    datetime64[ns, America/New_York]
Value                                  int64
dtype: object
Answered By: jezrael
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.