Python finding duplicates more than 2 occurrences in all columns
Question:
I have a dataframe with 2 columns
id Date
1 2022-01-01
1 2022-01-01
1 2022-01-01
1 2022-01-02
1 2022-01-02
1 2022-01-02
2 2022-01-01
2 2022-01-01
3 2022-01-01
3 2022-01-01
3 2022-01-01
3 2022-01-01
I need to find the duplicated rows (or duplicated IDs) when there are same values in all columns and those rows should appear more than twice.
The result should be like this
id Date
1 2022-01-01
1 2022-01-02
3 2022-01-01
Answers:
Here’s a way to find rows duplicated more than twice:
df2 = df.assign(x=0).groupby(list(df.columns)).count().query('x>2').drop(columns='x').reset_index()
Ouput:
id Date
0 1 2022-01-01
1 1 2022-01-02
2 3 2022-01-01
Explanation:
- add a temporary column
x
- use
groupby().count()
to obtain unique combinations of the id, Date
key with the number of occurrences per group in column x
- use
query()
to filter for rows that occur more than twice using the condition x>2
- drop the temporary column
x
and use reset_index()
to restore id
and Date
as columns.
You can also use .value_counts
instead of groupby
, while the rest is similar to the answer of @constantstranger:
(df
# count duplicate rows
.value_counts()
# convert to DataFrame to query the number of occurrences later on
.to_frame()
# MultiIndex to column and renaming
.reset_index().rename(columns={0:'count'})
# Filter out only 3+ duplicates
.query('count > 2')
# Remove the temporary columns
.drop(columns='count')
)
giving
id Date
0 3 2022-01-01
1 1 2022-01-01
2 1 2022-01-02
I have a dataframe with 2 columns
id Date
1 2022-01-01
1 2022-01-01
1 2022-01-01
1 2022-01-02
1 2022-01-02
1 2022-01-02
2 2022-01-01
2 2022-01-01
3 2022-01-01
3 2022-01-01
3 2022-01-01
3 2022-01-01
I need to find the duplicated rows (or duplicated IDs) when there are same values in all columns and those rows should appear more than twice.
The result should be like this
id Date
1 2022-01-01
1 2022-01-02
3 2022-01-01
Here’s a way to find rows duplicated more than twice:
df2 = df.assign(x=0).groupby(list(df.columns)).count().query('x>2').drop(columns='x').reset_index()
Ouput:
id Date
0 1 2022-01-01
1 1 2022-01-02
2 3 2022-01-01
Explanation:
- add a temporary column
x
- use
groupby().count()
to obtain unique combinations of theid, Date
key with the number of occurrences per group in columnx
- use
query()
to filter for rows that occur more than twice using the conditionx>2
- drop the temporary column
x
and usereset_index()
to restoreid
andDate
as columns.
You can also use .value_counts
instead of groupby
, while the rest is similar to the answer of @constantstranger:
(df
# count duplicate rows
.value_counts()
# convert to DataFrame to query the number of occurrences later on
.to_frame()
# MultiIndex to column and renaming
.reset_index().rename(columns={0:'count'})
# Filter out only 3+ duplicates
.query('count > 2')
# Remove the temporary columns
.drop(columns='count')
)
giving
id Date
0 3 2022-01-01
1 1 2022-01-01
2 1 2022-01-02