pd.to_datetime producing "Reindexing only valid with uniquely valued Index objects"
Question:
I’m getting quite an unexpected behaviour using pandas’ pd.to_datetime
. My pd.Series
is as follows:
0 2017-01-06 14:37:16
1 2017-01-27 00:00:00
2 2017-01-18 00:00:00
3 2017-01-26 00:00:00
4 None
...
454823 2019-10-22 11:20:03
454824 None
454825 2019-07-11 00:00:00
454826 None
454827 2019-07-15 00:00:00
Name: colx, Length: 454828, dtype: object
And when casting to datetime
I’m getting:
pd.to_datetime(df.colx, errors='coerce')
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
While this seems to indicate that there are duplicate values in the index, I ran the following tests to check if that was the cause:
all(df.colx.index == range(df.colx.shape[0]))
# True
df.index.duplicated().any()
# False
So apparently there aren’t any duplicate indexes. What could be causing this error?
Answers:
Start from clearing a misunderstanding
I noticed such an error in your check procedure:
Calling df.index.duplicated().any()
checks only that the index has no
duplicates.
To investigate the issue, I created my input file from your data (just 10 data rows):
colx
2017-01-06 14:37:16
2017-01-27 00:00:00
2017-01-18 00:00:00
2017-01-26 00:00:00
None
2019-10-22 11:20:03
None
2019-07-11 00:00:00
None
2019-07-15 00:00:00
I read it calling read_csv, called df.duplicated().any()
and the
result was True, so there are duplicates in colx column.
Run df.duplicated()
and you will see that True is printed for
rows with index 6 and 8 (second and third instance of None string).
Another check: Run df.info()
and you will get:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
colx 10 non-null object
dtypes: object(1)
memory usage: 208.0+ bytes
This confirms that no element has “true” None value.
There are only strings containing “None”.
Another remark: all(df.colx.index == range(df.colx.shape[0]))
checks
only that the index contains consecutive numbers, which says
nothing about the content of colx.
How you read your DataFrame
I suppose your read your DataFrame calling e.g. read_csv, without any
conversion, so colx column is of object (actually string) type.
In such case an attempt to call pd.to_datetime fails on the first
element containing None (a string), because it can not be converted
to datetime.
What to do
Try the following approach:
-
When reading the DataFrame, pass na_values=[‘None’] parameter.
It provides that elements containing None are not left as strings,
but are converted to NaNs.
-
Print the DataFrame (read from my limited source).
Instead of None (a string) there will be NaN – a special case of float.
-
Run df.info(). This time the printout will be:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
colx 7 non-null object
dtypes: object(1)
memory usage: 208.0+ bytes
Note that there are only 7 non-null values, out of total 10,
so the 3 remaining are “true” None values, which Pandas prints as NaN.
-
Run pd.to_datetime(df.colx). This time there should be no error.
The error you are receiving isn’t related to your Index
. It’s related to this issue which should have been fixed as of 0.25.0
. The issue was related to the following inconsistent handling of multiple null
types when using pd.to_datetime
import pandas as pd
pd.core.algorithms.unique([pd.NaT, None])
#array([NaT, None], dtype=object) # That is, `pd.NaT` and `None` are unique
pd.Index([pd.NaT, None]).is_unique
#False # However, Index.unique considers them duplicated
The handling of this different behavior should have been fixed with 0.25.0
. If it’s still giving you a problem the solution is to ensure you have a single representation of null
values in the column you need to convert. In this case you can .fillna
with the datetime null value.
pd.to_datetime(df.colx.fillna(pd.NaT), errors='coerce')
pd.__version__
#'0.25.0'
pd.to_datetime([pd.NaT, None])
#DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)
I have same error, for me works next:
pd.to_datetime(df.colx.astype(str))
I’m getting quite an unexpected behaviour using pandas’ pd.to_datetime
. My pd.Series
is as follows:
0 2017-01-06 14:37:16
1 2017-01-27 00:00:00
2 2017-01-18 00:00:00
3 2017-01-26 00:00:00
4 None
...
454823 2019-10-22 11:20:03
454824 None
454825 2019-07-11 00:00:00
454826 None
454827 2019-07-15 00:00:00
Name: colx, Length: 454828, dtype: object
And when casting to datetime
I’m getting:
pd.to_datetime(df.colx, errors='coerce')
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
While this seems to indicate that there are duplicate values in the index, I ran the following tests to check if that was the cause:
all(df.colx.index == range(df.colx.shape[0]))
# True
df.index.duplicated().any()
# False
So apparently there aren’t any duplicate indexes. What could be causing this error?
Start from clearing a misunderstanding
I noticed such an error in your check procedure:
Calling df.index.duplicated().any()
checks only that the index has no
duplicates.
To investigate the issue, I created my input file from your data (just 10 data rows):
colx
2017-01-06 14:37:16
2017-01-27 00:00:00
2017-01-18 00:00:00
2017-01-26 00:00:00
None
2019-10-22 11:20:03
None
2019-07-11 00:00:00
None
2019-07-15 00:00:00
I read it calling read_csv, called df.duplicated().any()
and the
result was True, so there are duplicates in colx column.
Run df.duplicated()
and you will see that True is printed for
rows with index 6 and 8 (second and third instance of None string).
Another check: Run df.info()
and you will get:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
colx 10 non-null object
dtypes: object(1)
memory usage: 208.0+ bytes
This confirms that no element has “true” None value.
There are only strings containing “None”.
Another remark: all(df.colx.index == range(df.colx.shape[0]))
checks
only that the index contains consecutive numbers, which says
nothing about the content of colx.
How you read your DataFrame
I suppose your read your DataFrame calling e.g. read_csv, without any
conversion, so colx column is of object (actually string) type.
In such case an attempt to call pd.to_datetime fails on the first
element containing None (a string), because it can not be converted
to datetime.
What to do
Try the following approach:
-
When reading the DataFrame, pass na_values=[‘None’] parameter.
It provides that elements containing None are not left as strings,
but are converted to NaNs. -
Print the DataFrame (read from my limited source).
Instead of None (a string) there will be NaN – a special case of float. -
Run df.info(). This time the printout will be:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 1 columns): colx 7 non-null object dtypes: object(1) memory usage: 208.0+ bytes
Note that there are only 7 non-null values, out of total 10,
so the 3 remaining are “true” None values, which Pandas prints as NaN. -
Run pd.to_datetime(df.colx). This time there should be no error.
The error you are receiving isn’t related to your Index
. It’s related to this issue which should have been fixed as of 0.25.0
. The issue was related to the following inconsistent handling of multiple null
types when using pd.to_datetime
import pandas as pd
pd.core.algorithms.unique([pd.NaT, None])
#array([NaT, None], dtype=object) # That is, `pd.NaT` and `None` are unique
pd.Index([pd.NaT, None]).is_unique
#False # However, Index.unique considers them duplicated
The handling of this different behavior should have been fixed with 0.25.0
. If it’s still giving you a problem the solution is to ensure you have a single representation of null
values in the column you need to convert. In this case you can .fillna
with the datetime null value.
pd.to_datetime(df.colx.fillna(pd.NaT), errors='coerce')
pd.__version__
#'0.25.0'
pd.to_datetime([pd.NaT, None])
#DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)
I have same error, for me works next:
pd.to_datetime(df.colx.astype(str))