How to find columns in duplicated rows where are different values in DataFrame in Python Pandas?

Question:

I have DataFrame in Python like below where we can see duplicates for some ID:

ID COL1 COL2 COL3
123 XX 111 ENG
123 abc 111 ENG
444 ccc 2 o
444 ccc 2 o
67 a 89 xx

And I need to select rows where is situation like for ID = 123, where rows are duplicated but in some column / columns we have different value, so as an output I need something like below:

ID COL1 COL2 COL3
123 XX 111 ENG
123 abc 111 ENG

How can I do that in Python Pandas? I can add that in my real dataset I have many many more columns so I need to create solution whoch will be good for more columns not only ID,COL1,COL2,COL3 🙂

Asked By: unbik

||

Answers:

Using .query

df = df.query("ID.eq(123)").drop_duplicates().reset_index(drop=True)
print(df)

    ID COL1  COL2 COL3
0  123   XX   111  ENG
1  123  abc   111  ENG

Unless you aren’t also trying to filter:

df = df.drop_duplicates().reset_index(drop=True)
print(df)

    ID COL1  COL2 COL3
0  123   XX   111  ENG
1  123  abc   111  ENG
2  444  ccc     2    o
3   67    a    89   xx
Answered By: Jason Baker

here is one way to do it

# drop the duplicates
df.drop_duplicates(inplace=True)

# groupby ID and filter the ones where group size is greater than 1
df[df.groupby('ID')['ID'].transform('size')>1]

    ID  COL1    COL2    COL3
0   123     XX  111     ENG
1   123     abc     111     ENG

alternately,

# preserve the original DF and create a secondary DF with non-duplicate rows
df2=df.drop_duplicates()

# using loc, select the rows in DF2 that has a group size exceeding 1
df2.loc[df2.groupby('ID')['ID'].transform('size')>1]

Answered By: Naveed

first drop duplicates for all columns then find duplicates for id column. finally select same ids.

df = df.drop_duplicates()
mask = df.duplicated(subset=['ID'],keep=False)
df = df[mask]
Answered By: Clegane