Error when using Pandas.remove_duplicates()

Question:

I am attempting to use Pandas.drop_duplicates() by considering only a certain subset but am getting an error KeyError: Index(['days'], dtype='object')

The Index is as follows:
id, event_description, attribute1, attribute 2, attribute 3, days, days_supply, days_equivalent

I want to ignore attribute 2 and attribute 3 so I have ran the follow

df = df.drop_duplicates(subset=['id', 'event_description', 'attribute1', 'days', 'days_supply', 'days_equivalent'])

Which returns:

eyError                                  Traceback (most recent call last)
<ipython-input-4-3f7da32b380f> in <module>
      7 
      8 df = df.drop_duplicates(subset=['id', 'event_description', 'attribute1', 'days', 
->    9 'days_supply', 'days_equivalent'])
     10 
     11 print(df)

/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in drop_duplicates(self, subset, keep, inplace)
   4892 
   4893         inplace = validate_bool_kwarg(inplace, "inplace")
-> 4894         duplicated = self.duplicated(subset, keep=keep)
   4895 
   4896         if inplace:

/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in duplicated(self, subset, keep)
   4949         diff = Index(subset).difference(self.columns)
   4950         if not diff.empty:
-> 4951             raise KeyError(diff)
   4952 
   4953         vals = (col.values for name, col in self.items() if name in subset)

KeyError: Index(['days'], dtype='object')

Once I remove days, the remove duplicates runs without flaw, but I do need to make sure I consider days. What does the error require that I fix?

Asked By: Hayden

||

Answers:

Try with

df.drop_duplicates(subset=['id', 'event_description', 'attribute1', 'days', 'days_supply', 'days_equivalent'],inplace=True)

From:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

Try with

Maybe your df is not well formed, anyway if you think the issue has to do with the dtype you could use the function apply to check the whole data of df[‘date’] like this:

def checkType(someDate):
    ##Do verification
    return dateCorrected

df['date'] = df['date'].apply(checkType)
Answered By: Maria Blanco

Had to re-check column names. Days vs days

Answered By: Hayden

Also check that your column names are not dropped off for some reason. Perhaps as result of merging

df.columns

Answered By: David Dehghan

I reproduced a somewhat similar situation: A DataFrame with misconfigured columns (a superfluous pair of square brackets) returns a looks-like-OK result (Fig. 1).

array = [
    ['001', 3, 3, 3, 1, 5, 4, 3],
    ['002', 7, 2, 1, 1, 1, 5, 1],
    ['003', 1, 6, 7, 6, 6, 7, 7]]

# NG configuration of the columns.
df_NG = pd.DataFrame(
    array,
    columns=[
        ['id', 'event_description', 'attribute1', 'attribute 2', 'attribute 3',
         'days', 'days_supply', 'days_equivalent']])

Fig. 1 Pseudo-OK DataFrame (but rotten inside)
Fig. 1 Pseudo-OK DataFrame (but rotten inside)

But if you try to drop duplicates,

df_NG = df_NG.drop_duplicates(
    subset=[
        'id', 'event_description', 'attribute1',
        'days', 'days_supply', 'days_equivalent'])

Pandas returns:

 ---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [71], in <cell line: 1>()
----> 1 df_NG = df_NG.drop_duplicates(
      2     subset=[
      3         'id', 'event_description', 'attribute1',
      4         'days', 'days_supply', 'days_equivalent'])

File /usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File /usr/local/lib/python3.9/site-packages/pandas/core/frame.py:6125, in DataFrame.drop_duplicates(self, subset, keep, inplace, ignore_index)
   6123 inplace = validate_bool_kwarg(inplace, "inplace")
   6124 ignore_index = validate_bool_kwarg(ignore_index, "ignore_index")
-> 6125 duplicated = self.duplicated(subset, keep=keep)
   6127 result = self[-duplicated]
   6128 if ignore_index:

File /usr/local/lib/python3.9/site-packages/pandas/core/frame.py:6259, in DataFrame.duplicated(self, subset, keep)
   6257 diff = Index(subset).difference(self.columns)
   6258 if not diff.empty:
-> 6259     raise KeyError(diff)
   6261 vals = (col.values for name, col in self.items() if name in subset)
   6262 labels, shape = map(list, zip(*map(f, vals)))

KeyError: Index(['attribute1', 'days', 'days_equivalent', 'days_supply',
       'event_description', 'id'],
      dtype='object')

So I followed David’s suggestion and found the culprit!

>>> df_NG.columns

MultiIndex([(               'id',),
        ('event_description',),
        (       'attribute1',),
        (      'attribute 2',),
        (      'attribute 3',),
        (             'days',),
        (      'days_supply',),
        (  'days_equivalent',)],
       )

The correct configuration is, of course, as follows:)

df_OK = pd.DataFrame(
    array,
    columns=[
        'id', 'event_description', 'attribute1', 'attribute 2', 'attribute 3',
        'days', 'days_supply', 'days_equivalent'])
Answered By: yeiichi
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.