how to filter out rows in a dataframe whose value begins with a certain word
Question:
my df looks like this
name type info
90 Sizeer - Annopol 2 shoe duplicate SIZEER
91 InterSport - Arkadia sport duplicate INTERSPORT
92 InterSport - Złota 59 sport NaN
...
what i want to do is to remove all rows where the value in info
column starts with the word "duplicate". Its kinda tricky because this columns has not only string values, but also booleans. Moreover, the ones i wish to delete are not just ‘duplicate’, they have more text afterwards.
i tried doing this
duplicates = []
for i in range(df.shape[0]):
if str(df['info'])[i][:10] == 'duplicate':
duplicates.append(i)
to get their ID’s so i can delete them later, but it dosen’t do anything. If i removed str()
from if str(df['info'])[i][:10] == 'duplicate':
there’s an error
TypeError: 'float' object is not subscriptable
i also did this
dupli = df[df['info'] np.where('duplicate' in df['info'])]
but it’s just a syntax error i dont really know how to do this properly 😀
Answers:
The simplest way for the word ‘duplicate’ at the start of the text:
df = df[~df.info.str.startswith('duplicate', na=False)]
If you want similarly but anywhere in the text:
df = df[~df.info.str.contains('duplicate', na=False)]
my df looks like this
name type info
90 Sizeer - Annopol 2 shoe duplicate SIZEER
91 InterSport - Arkadia sport duplicate INTERSPORT
92 InterSport - Złota 59 sport NaN
...
what i want to do is to remove all rows where the value in info
column starts with the word "duplicate". Its kinda tricky because this columns has not only string values, but also booleans. Moreover, the ones i wish to delete are not just ‘duplicate’, they have more text afterwards.
i tried doing this
duplicates = []
for i in range(df.shape[0]):
if str(df['info'])[i][:10] == 'duplicate':
duplicates.append(i)
to get their ID’s so i can delete them later, but it dosen’t do anything. If i removed str()
from if str(df['info'])[i][:10] == 'duplicate':
there’s an error
TypeError: 'float' object is not subscriptable
i also did this
dupli = df[df['info'] np.where('duplicate' in df['info'])]
but it’s just a syntax error i dont really know how to do this properly 😀
The simplest way for the word ‘duplicate’ at the start of the text:
df = df[~df.info.str.startswith('duplicate', na=False)]
If you want similarly but anywhere in the text:
df = df[~df.info.str.contains('duplicate', na=False)]