Python finding duplicates more than 2 occurrences in all columns

Question:

I have a dataframe with 2 columns

id  Date
1   2022-01-01
1   2022-01-01
1   2022-01-01
1   2022-01-02
1   2022-01-02
1   2022-01-02
2   2022-01-01
2   2022-01-01
3   2022-01-01
3   2022-01-01
3   2022-01-01
3   2022-01-01

I need to find the duplicated rows (or duplicated IDs) when there are same values in all columns and those rows should appear more than twice.

The result should be like this

id  Date
1   2022-01-01
1   2022-01-02
3   2022-01-01
Asked By: ML_Enthu

||

Answers:

Here’s a way to find rows duplicated more than twice:

df2 = df.assign(x=0).groupby(list(df.columns)).count().query('x>2').drop(columns='x').reset_index()

Ouput:

   id        Date
0   1  2022-01-01
1   1  2022-01-02
2   3  2022-01-01

Explanation:

  • add a temporary column x
  • use groupby().count() to obtain unique combinations of the id, Date key with the number of occurrences per group in column x
  • use query() to filter for rows that occur more than twice using the condition x>2
  • drop the temporary column x and use reset_index() to restore id and Date as columns.
Answered By: constantstranger

You can also use .value_counts instead of groupby, while the rest is similar to the answer of @constantstranger:

(df
   # count duplicate rows
   .value_counts()

   # convert to DataFrame to query the number of occurrences later on
   .to_frame()

   # MultiIndex to column and renaming
   .reset_index().rename(columns={0:'count'})

   # Filter out only 3+ duplicates
   .query('count > 2')

   # Remove the temporary columns
   .drop(columns='count')
)

giving

   id        Date                                                                                 
0   3  2022-01-01                                                                                 
1   1  2022-01-01                                                                                 
2   1  2022-01-02   
Answered By: Ali Değer Ozbakir
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.