df.unique() on whole DataFrame based on a column

Question:

I have a DataFrame df filled with rows and columns where there are duplicate Id’s:

Index   Id   Type
0       a1   A
1       a2   A
2       b1   B
3       b3   B
4       a1   A
...

When I use:

uniqueId = df["Id"].unique() 

I get a list of unique IDs.

How can I apply this filtering on the whole DataFrame such that it keeps the structure but that the duplicates (based on "Id") are removed?

Asked By: JohnAndrews

||

Answers:

It seems you need DataFrame.drop_duplicates with parameter subset which specify where are test duplicates:

#keep first duplicate value
df = df.drop_duplicates(subset=['Id'])
print (df)
       Id Type
Index         
0      a1    A
1      a2    A
2      b1    B
3      b3    B

#keep last duplicate value
df = df.drop_duplicates(subset=['Id'], keep='last')
print (df)
       Id Type
Index         
1      a2    A
2      b1    B
3      b3    B
4      a1    A

#remove all duplicate values
df = df.drop_duplicates(subset=['Id'], keep=False)
print (df)
       Id Type
Index         
1      a2    A
2      b1    B
3      b3    B
Answered By: jezrael

It’s also possible to call duplicated() to flag the duplicates and drop the negation of the flags.

df = df[~df.duplicated(subset=['Id'])].copy()

res1

This is particularly useful if you want to conditionally drop duplicates, e.g. drop duplicates of a specific value, etc. For example, the following code drops duplicate 'a1's from column Id (other duplicates are not dropped).

new_df = df[~df['Id'].duplicated() | df['Id'].ne('a1')].copy()

res2

Answered By: cottontail