Error when using Pandas.remove_duplicates()
Question:
I am attempting to use Pandas.drop_duplicates() by considering only a certain subset but am getting an error KeyError: Index(['days'], dtype='object')
The Index is as follows:
id, event_description, attribute1, attribute 2, attribute 3, days, days_supply, days_equivalent
I want to ignore attribute 2 and attribute 3 so I have ran the follow
df = df.drop_duplicates(subset=['id', 'event_description', 'attribute1', 'days', 'days_supply', 'days_equivalent'])
Which returns:
eyError Traceback (most recent call last)
<ipython-input-4-3f7da32b380f> in <module>
7
8 df = df.drop_duplicates(subset=['id', 'event_description', 'attribute1', 'days',
-> 9 'days_supply', 'days_equivalent'])
10
11 print(df)
/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in drop_duplicates(self, subset, keep, inplace)
4892
4893 inplace = validate_bool_kwarg(inplace, "inplace")
-> 4894 duplicated = self.duplicated(subset, keep=keep)
4895
4896 if inplace:
/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in duplicated(self, subset, keep)
4949 diff = Index(subset).difference(self.columns)
4950 if not diff.empty:
-> 4951 raise KeyError(diff)
4952
4953 vals = (col.values for name, col in self.items() if name in subset)
KeyError: Index(['days'], dtype='object')
Once I remove days
, the remove duplicates runs without flaw, but I do need to make sure I consider days
. What does the error require that I fix?
Answers:
Try with
df.drop_duplicates(subset=['id', 'event_description', 'attribute1', 'days', 'days_supply', 'days_equivalent'],inplace=True)
From:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
Try with
Maybe your df is not well formed, anyway if you think the issue has to do with the dtype you could use the function apply to check the whole data of df[‘date’] like this:
def checkType(someDate):
##Do verification
return dateCorrected
df['date'] = df['date'].apply(checkType)
Had to re-check column names. Days
vs days
Also check that your column names are not dropped off for some reason. Perhaps as result of merging
df.columns
I reproduced a somewhat similar situation: A DataFrame with misconfigured columns (a superfluous pair of square brackets) returns a looks-like-OK result (Fig. 1).
array = [
['001', 3, 3, 3, 1, 5, 4, 3],
['002', 7, 2, 1, 1, 1, 5, 1],
['003', 1, 6, 7, 6, 6, 7, 7]]
# NG configuration of the columns.
df_NG = pd.DataFrame(
array,
columns=[
['id', 'event_description', 'attribute1', 'attribute 2', 'attribute 3',
'days', 'days_supply', 'days_equivalent']])
Fig. 1 Pseudo-OK DataFrame (but rotten inside)
But if you try to drop duplicates,
df_NG = df_NG.drop_duplicates(
subset=[
'id', 'event_description', 'attribute1',
'days', 'days_supply', 'days_equivalent'])
Pandas returns:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [71], in <cell line: 1>()
----> 1 df_NG = df_NG.drop_duplicates(
2 subset=[
3 'id', 'event_description', 'attribute1',
4 'days', 'days_supply', 'days_equivalent'])
File /usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
305 if len(args) > num_allow_args:
306 warnings.warn(
307 msg.format(arguments=arguments),
308 FutureWarning,
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
File /usr/local/lib/python3.9/site-packages/pandas/core/frame.py:6125, in DataFrame.drop_duplicates(self, subset, keep, inplace, ignore_index)
6123 inplace = validate_bool_kwarg(inplace, "inplace")
6124 ignore_index = validate_bool_kwarg(ignore_index, "ignore_index")
-> 6125 duplicated = self.duplicated(subset, keep=keep)
6127 result = self[-duplicated]
6128 if ignore_index:
File /usr/local/lib/python3.9/site-packages/pandas/core/frame.py:6259, in DataFrame.duplicated(self, subset, keep)
6257 diff = Index(subset).difference(self.columns)
6258 if not diff.empty:
-> 6259 raise KeyError(diff)
6261 vals = (col.values for name, col in self.items() if name in subset)
6262 labels, shape = map(list, zip(*map(f, vals)))
KeyError: Index(['attribute1', 'days', 'days_equivalent', 'days_supply',
'event_description', 'id'],
dtype='object')
So I followed David’s suggestion and found the culprit!
>>> df_NG.columns
MultiIndex([( 'id',),
('event_description',),
( 'attribute1',),
( 'attribute 2',),
( 'attribute 3',),
( 'days',),
( 'days_supply',),
( 'days_equivalent',)],
)
The correct configuration is, of course, as follows:)
df_OK = pd.DataFrame(
array,
columns=[
'id', 'event_description', 'attribute1', 'attribute 2', 'attribute 3',
'days', 'days_supply', 'days_equivalent'])
I am attempting to use Pandas.drop_duplicates() by considering only a certain subset but am getting an error KeyError: Index(['days'], dtype='object')
The Index is as follows:
id, event_description, attribute1, attribute 2, attribute 3, days, days_supply, days_equivalent
I want to ignore attribute 2 and attribute 3 so I have ran the follow
df = df.drop_duplicates(subset=['id', 'event_description', 'attribute1', 'days', 'days_supply', 'days_equivalent'])
Which returns:
eyError Traceback (most recent call last)
<ipython-input-4-3f7da32b380f> in <module>
7
8 df = df.drop_duplicates(subset=['id', 'event_description', 'attribute1', 'days',
-> 9 'days_supply', 'days_equivalent'])
10
11 print(df)
/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in drop_duplicates(self, subset, keep, inplace)
4892
4893 inplace = validate_bool_kwarg(inplace, "inplace")
-> 4894 duplicated = self.duplicated(subset, keep=keep)
4895
4896 if inplace:
/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in duplicated(self, subset, keep)
4949 diff = Index(subset).difference(self.columns)
4950 if not diff.empty:
-> 4951 raise KeyError(diff)
4952
4953 vals = (col.values for name, col in self.items() if name in subset)
KeyError: Index(['days'], dtype='object')
Once I remove days
, the remove duplicates runs without flaw, but I do need to make sure I consider days
. What does the error require that I fix?
Try with
df.drop_duplicates(subset=['id', 'event_description', 'attribute1', 'days', 'days_supply', 'days_equivalent'],inplace=True)
From:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
Try with
Maybe your df is not well formed, anyway if you think the issue has to do with the dtype you could use the function apply to check the whole data of df[‘date’] like this:
def checkType(someDate):
##Do verification
return dateCorrected
df['date'] = df['date'].apply(checkType)
Had to re-check column names. Days
vs days
Also check that your column names are not dropped off for some reason. Perhaps as result of merging
df.columns
I reproduced a somewhat similar situation: A DataFrame with misconfigured columns (a superfluous pair of square brackets) returns a looks-like-OK result (Fig. 1).
array = [
['001', 3, 3, 3, 1, 5, 4, 3],
['002', 7, 2, 1, 1, 1, 5, 1],
['003', 1, 6, 7, 6, 6, 7, 7]]
# NG configuration of the columns.
df_NG = pd.DataFrame(
array,
columns=[
['id', 'event_description', 'attribute1', 'attribute 2', 'attribute 3',
'days', 'days_supply', 'days_equivalent']])
Fig. 1 Pseudo-OK DataFrame (but rotten inside)
But if you try to drop duplicates,
df_NG = df_NG.drop_duplicates(
subset=[
'id', 'event_description', 'attribute1',
'days', 'days_supply', 'days_equivalent'])
Pandas returns:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [71], in <cell line: 1>()
----> 1 df_NG = df_NG.drop_duplicates(
2 subset=[
3 'id', 'event_description', 'attribute1',
4 'days', 'days_supply', 'days_equivalent'])
File /usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
305 if len(args) > num_allow_args:
306 warnings.warn(
307 msg.format(arguments=arguments),
308 FutureWarning,
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
File /usr/local/lib/python3.9/site-packages/pandas/core/frame.py:6125, in DataFrame.drop_duplicates(self, subset, keep, inplace, ignore_index)
6123 inplace = validate_bool_kwarg(inplace, "inplace")
6124 ignore_index = validate_bool_kwarg(ignore_index, "ignore_index")
-> 6125 duplicated = self.duplicated(subset, keep=keep)
6127 result = self[-duplicated]
6128 if ignore_index:
File /usr/local/lib/python3.9/site-packages/pandas/core/frame.py:6259, in DataFrame.duplicated(self, subset, keep)
6257 diff = Index(subset).difference(self.columns)
6258 if not diff.empty:
-> 6259 raise KeyError(diff)
6261 vals = (col.values for name, col in self.items() if name in subset)
6262 labels, shape = map(list, zip(*map(f, vals)))
KeyError: Index(['attribute1', 'days', 'days_equivalent', 'days_supply',
'event_description', 'id'],
dtype='object')
So I followed David’s suggestion and found the culprit!
>>> df_NG.columns
MultiIndex([( 'id',),
('event_description',),
( 'attribute1',),
( 'attribute 2',),
( 'attribute 3',),
( 'days',),
( 'days_supply',),
( 'days_equivalent',)],
)
The correct configuration is, of course, as follows:)
df_OK = pd.DataFrame(
array,
columns=[
'id', 'event_description', 'attribute1', 'attribute 2', 'attribute 3',
'days', 'days_supply', 'days_equivalent'])