Grouping by multiple columns to find duplicate rows pandas

Question

I have a df

id    val1     val2
 1     1.1      2.2
 1     1.1      2.2
 2     2.1      5.5
 3     8.8      6.2
 4     1.1      2.2
 5     8.8      6.2

I want to group by val1 and val2 and get similar dataframe only with rows which has multiple occurrence of same val1 and val2 combination.

Final df:

id    val1     val2
 1     1.1      2.2
 4     1.1      2.2
 3     8.8      6.2
 5     8.8      6.2

Asked By: Shubham R

||

Source

Answer 1

You need duplicated with parameter subset for specify columns for check with keep=False for all duplicates for mask and filter by boolean indexing:

df = df[df.duplicated(subset=['val1','val2'], keep=False)]
print (df)
   id  val1  val2
0   1   1.1   2.2
1   1   1.1   2.2
3   3   8.8   6.2
4   4   1.1   2.2
5   5   8.8   6.2

Detail:

print (df.duplicated(subset=['val1','val2'], keep=False))
0     True
1     True
2    False
3     True
4     True
5     True
dtype: bool

Answered By: jezrael

Answer 2

Another method is to compute the size of groups and only keep the rows whose group is larger than 1.

msk = df.groupby(['val1', 'val2'])['val1'].transform('size') > 1
df1 = df[msk]

Answered By: cottontail

Grouping by multiple columns to find duplicate rows pandas

Question:

Answers: