Confirm if duplicated employees are in sequence in pandas dataframe

Question

Imagine I have the following dataframe with repetitive people by firstname and lastname:

ID  FirstName   LastName    Country
1   Paulo       Cortez      Brasil
2   Paulo       Cortez      Brasil
3   Paulo       Cortez      Espanha
1   Maria       Lurdes      Espanha
1   Maria       Lurdes      Espanha
1   John        Page        USA
2   Felipe      Cardoso     Brasil
2   John        Page        USA
3   Felipe      Cardoso     Espanha
2  Steve        Xis         UK
1  Peter        Dave        UK
np.nan  Peter   Dave        UK

The issue I have is, if the person appears once, the ID should always be 1. If the person appears more than once (looking by only firstname and lastname) the ID should be sequential starting with 1 (in any row) and adding +1 for each other duplicated record.

I need a way to filter this dataframe to find people not following this logic (getting either the unique record or all records of the person if duplicated), this way returning this data:

ID  FirstName   LastName    Country
1   Maria       Lurdes      Espanha
1   Maria       Lurdes      Espanha
2   Felipe      Cardoso     Brasil
3   Felipe      Cardoso     Espanha
2  Steve        Xis         UK
1  Peter        Dave        UK
np.nan  Peter   Dave        UK

What would be the best way to achieve it?

Asked By: Paulo Cortez

||

Source

Answer 1

Since the ID sequence can be represented as a linear list from 1 to the number of entries + 1, you can use a groupby and filter to find your odd ones

First the code, then an explanation:

>>> df.groupby(["FirstName","LastName"]).filter(lambda x: sorted(x.ID) != list(range(1, x.ID.count()+1)))
     ID FirstName LastName  Country
3   1.0     Maria   Lurdes  Espanha
4   1.0     Maria   Lurdes  Espanha
6   2.0    Felipe  Cardoso   Brasil
8   3.0    Felipe  Cardoso  Espanha
9   2.0     Steve      Xis       UK
10  1.0     Peter     Dave       UK
11  NaN     Peter     Dave       UK

First, we do a .groupby(["FirstName","LastName"]) to ensure that a group only consists of people with the exact same name.
We then filter the groups with .filter(lambda x: sorted(x.ID) != list(range(1, len(x.ID)+1))). This section checks whether the IDs are sorted and equal to the ideal list, going from 1 to n where n is the number of entries.

Answered By: Jakob Guldberg Aaes

Confirm if duplicated employees are in sequence in pandas dataframe

Question:

Answers: