Flag if a row is duplicated an attach if it's the 1st, 2nd, etc duplicated row

Question:

I’d like to flag if a row is duplicated, and attach if it’s the 1st, 2nd, 3rd, etc duplicated column in a Pandas DataFrame.

More visually, I’d like to go from:

id Country City
1 France Paris
2 France Paris
3 France Lyon
4 France Lyon
5 France Lyon

to

id Country City duplicated_flag
1 France Paris 1
2 France Paris 1
3 France Lyon 2
4 France Lyon 2
5 France Lyon 2

Note that id is not taken into account to see if the row is duplicated.

Asked By: Marta Campos

||

Answers:

Two options:

First, if you have lots of columns that you need to compare, you can use:

comparison_df = df.drop("id", axis=1)
df["duplicated_flag"] = (comparison_df != comparison_df.shift()).any(axis=1).cumsum()

We drop the columns that aren’t needed in the comparison. Then, we check whether each row is equivalent to the one above it using .shift() and .any(). Finally, we read off the value of duplicated_flag using .cumsum().


But, if you only have two columns to compare (or if for some reason you have lots of columns that you need to drop), you can find mismatched rows one at a time, and then use .cumsum() to get the value of duplicated_flag for each row. It’s a bit more verbose so I’m not super happy with this option, but I’m leaving this here for completeness in case this suits your use case better:

country_comparison = df["Country"].ne(df["Country"].shift())
city_comparison = df["City"].ne(df["City"].shift())
df["duplicated_flag"] = (country_comparison | city_comparison).cumsum()
print(df)

These output:

     id    Country      City  duplicated_flag
0  1     France     Paris                   1
1  2     France     Paris                   1
2  3     France      Lyon                   2
3  4     France      Lyon                   2
4  5     France      Lyon                   2
Answered By: BrokenBenchmark