How to delete duplicates pandas
Question:
I need to check if there are some duplicates value in one column of a dataframe using Pandas and, if there is any duplicate, delete the entire row.
I need to check just the first column.
Example:
object type
apple fruit
ball toy
banana fruit
xbox videogame
banana fruit
apple fruit
What i need is:
object type
apple fruit
ball toy
banana fruit
xbox videogame
I can delete the ‘object’ duplicates with the following code, but I can’t delete the entire row that contains the duplicate as the second column won’t be deleted.
df = pd.read_csv(directory, header=None,)
objects= df[0]
for object in df[0]:
Answers:
Select by duplicated mask and negate it
df = df[~df["object"].duplicated()]
Which gives
object type
0 apple fruit
1 ball toy
2 banana fruit
3 xbox videogame
use drop_duplicates method
d = pd.DataFrame(
{'object': ['apple', 'ball', 'banana', 'xbox', 'banana', 'apple'],
'type': ['fruit', 'toy', 'fruit', 'videogame', 'fruit', 'fruit']}
)
d.drop_duplicates()
there are several keyword args. that might come in handy (like inplace=True
if you want your dataframe d
to be updated)
You can use .drop_duplicates()
with parameter subset='object'
to select the column you want to check, as follows:
df_out = df.drop_duplicates(subset='object')
Result:
print(df_out)
object type
0 apple fruit
1 ball toy
2 banana fruit
3 xbox videogame
To get the length after dropping duplicates
df = len(df)-len(df.drop_duplicates())
I need to check if there are some duplicates value in one column of a dataframe using Pandas and, if there is any duplicate, delete the entire row.
I need to check just the first column.
Example:
object type
apple fruit
ball toy
banana fruit
xbox videogame
banana fruit
apple fruit
What i need is:
object type
apple fruit
ball toy
banana fruit
xbox videogame
I can delete the ‘object’ duplicates with the following code, but I can’t delete the entire row that contains the duplicate as the second column won’t be deleted.
df = pd.read_csv(directory, header=None,)
objects= df[0]
for object in df[0]:
Select by duplicated mask and negate it
df = df[~df["object"].duplicated()]
Which gives
object type
0 apple fruit
1 ball toy
2 banana fruit
3 xbox videogame
use drop_duplicates method
d = pd.DataFrame(
{'object': ['apple', 'ball', 'banana', 'xbox', 'banana', 'apple'],
'type': ['fruit', 'toy', 'fruit', 'videogame', 'fruit', 'fruit']}
)
d.drop_duplicates()
there are several keyword args. that might come in handy (like inplace=True
if you want your dataframe d
to be updated)
You can use .drop_duplicates()
with parameter subset='object'
to select the column you want to check, as follows:
df_out = df.drop_duplicates(subset='object')
Result:
print(df_out)
object type
0 apple fruit
1 ball toy
2 banana fruit
3 xbox videogame
To get the length after dropping duplicates
df = len(df)-len(df.drop_duplicates())