how to filter out rows in a dataframe whose value begins with a certain word

Question:

my df looks like this

    name                    type    info
90  Sizeer - Annopol 2      shoe    duplicate SIZEER 
91  InterSport - Arkadia    sport   duplicate INTERSPORT 
92  InterSport - Złota 59   sport   NaN
...

what i want to do is to remove all rows where the value in info column starts with the word "duplicate". Its kinda tricky because this columns has not only string values, but also booleans. Moreover, the ones i wish to delete are not just ‘duplicate’, they have more text afterwards.

i tried doing this

duplicates = []
for i in range(df.shape[0]):
    if str(df['info'])[i][:10] == 'duplicate':
         duplicates.append(i)

to get their ID’s so i can delete them later, but it dosen’t do anything. If i removed str() from if str(df['info'])[i][:10] == 'duplicate': there’s an error

TypeError: 'float' object is not subscriptable

i also did this

dupli = df[df['info'] np.where('duplicate' in df['info'])]

but it’s just a syntax error i dont really know how to do this properly 😀

Asked By: adamDud

||

Answers:

The simplest way for the word ‘duplicate’ at the start of the text:

df = df[~df.info.str.startswith('duplicate', na=False)]

If you want similarly but anywhere in the text:

df = df[~df.info.str.contains('duplicate', na=False)]
Answered By: gtomer
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.