Flag if a row is duplicated an attach if it's the 1st, 2nd, etc duplicated row

Question

I’d like to flag if a row is duplicated, and attach if it’s the 1st, 2nd, 3rd, etc duplicated column in a Pandas DataFrame.

More visually, I’d like to go from:

id	Country	City
1	France	Paris
2	France	Paris
3	France	Lyon
4	France	Lyon
5	France	Lyon

to

id	Country	City	duplicated_flag
1	France	Paris	1
2	France	Paris	1
3	France	Lyon	2
4	France	Lyon	2
5	France	Lyon	2

Note that id is not taken into account to see if the row is duplicated.

Asked By: Marta Campos

||

Source

Answer 1

Two options:

First, if you have lots of columns that you need to compare, you can use:

comparison_df = df.drop("id", axis=1)
df["duplicated_flag"] = (comparison_df != comparison_df.shift()).any(axis=1).cumsum()

We drop the columns that aren’t needed in the comparison. Then, we check whether each row is equivalent to the one above it using .shift() and .any(). Finally, we read off the value of duplicated_flag using .cumsum().

But, if you only have two columns to compare (or if for some reason you have lots of columns that you need to drop), you can find mismatched rows one at a time, and then use .cumsum() to get the value of duplicated_flag for each row. It’s a bit more verbose so I’m not super happy with this option, but I’m leaving this here for completeness in case this suits your use case better:

country_comparison = df["Country"].ne(df["Country"].shift())
city_comparison = df["City"].ne(df["City"].shift())
df["duplicated_flag"] = (country_comparison | city_comparison).cumsum()
print(df)

These output:

     id    Country      City  duplicated_flag
0  1     France     Paris                   1
1  2     France     Paris                   1
2  3     France      Lyon                   2
3  4     France      Lyon                   2
4  5     France      Lyon                   2

Answered By: BrokenBenchmark

Flag if a row is duplicated an attach if it's the 1st, 2nd, etc duplicated row

Question:

Answers: