pd.to_datetime producing "Reindexing only valid with uniquely valued Index objects"

Question:

I’m getting quite an unexpected behaviour using pandas’ pd.to_datetime. My pd.Series is as follows:

0         2017-01-06 14:37:16
1         2017-01-27 00:00:00
2         2017-01-18 00:00:00
3         2017-01-26 00:00:00
4                        None
                 ...         
454823    2019-10-22 11:20:03
454824                   None
454825    2019-07-11 00:00:00
454826                   None
454827    2019-07-15 00:00:00
Name: colx, Length: 454828, dtype: object

And when casting to datetime I’m getting:

pd.to_datetime(df.colx, errors='coerce')

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

While this seems to indicate that there are duplicate values in the index, I ran the following tests to check if that was the cause:

all(df.colx.index == range(df.colx.shape[0]))
# True

df.index.duplicated().any()
# False

So apparently there aren’t any duplicate indexes. What could be causing this error?

Asked By: yatu

||

Answers:

Start from clearing a misunderstanding

I noticed such an error in your check procedure:

Calling df.index.duplicated().any() checks only that the index has no
duplicates.

To investigate the issue, I created my input file from your data (just 10 data rows):

colx
2017-01-06 14:37:16
2017-01-27 00:00:00
2017-01-18 00:00:00
2017-01-26 00:00:00
None
2019-10-22 11:20:03
None
2019-07-11 00:00:00
None
2019-07-15 00:00:00

I read it calling read_csv, called df.duplicated().any() and the
result was True, so there are duplicates in colx column.

Run df.duplicated() and you will see that True is printed for
rows with index 6 and 8 (second and third instance of None string).

Another check: Run df.info() and you will get:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
colx    10 non-null object
dtypes: object(1)
memory usage: 208.0+ bytes

This confirms that no element has “true” None value.
There are only strings containing “None”.

Another remark: all(df.colx.index == range(df.colx.shape[0])) checks
only that the index contains consecutive numbers, which says
nothing about the content of colx.

How you read your DataFrame

I suppose your read your DataFrame calling e.g. read_csv, without any
conversion, so colx column is of object (actually string) type.

In such case an attempt to call pd.to_datetime fails on the first
element containing None (a string), because it can not be converted
to datetime.

What to do

Try the following approach:

  1. When reading the DataFrame, pass na_values=[‘None’] parameter.
    It provides that elements containing None are not left as strings,
    but are converted to NaNs.

  2. Print the DataFrame (read from my limited source).
    Instead of None (a string) there will be NaN – a special case of float.

  3. Run df.info(). This time the printout will be:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10 entries, 0 to 9
    Data columns (total 1 columns):
    colx    7 non-null object
    dtypes: object(1)
    memory usage: 208.0+ bytes
    

    Note that there are only 7 non-null values, out of total 10,
    so the 3 remaining are “true” None values, which Pandas prints as NaN.

  4. Run pd.to_datetime(df.colx). This time there should be no error.

Answered By: Valdi_Bo

The error you are receiving isn’t related to your Index. It’s related to this issue which should have been fixed as of 0.25.0. The issue was related to the following inconsistent handling of multiple null types when using pd.to_datetime

import pandas as pd

pd.core.algorithms.unique([pd.NaT, None])
#array([NaT, None], dtype=object) # That is, `pd.NaT` and `None` are unique

pd.Index([pd.NaT, None]).is_unique
#False   # However, Index.unique considers them duplicated

The handling of this different behavior should have been fixed with 0.25.0. If it’s still giving you a problem the solution is to ensure you have a single representation of null values in the column you need to convert. In this case you can .fillna with the datetime null value.

pd.to_datetime(df.colx.fillna(pd.NaT), errors='coerce')

pd.__version__
#'0.25.0'

pd.to_datetime([pd.NaT, None])
#DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)
Answered By: ALollz

I have same error, for me works next:

pd.to_datetime(df.colx.astype(str))
Answered By: Alexey Seleznev
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.