Confirm if duplicated employees are in sequence in pandas dataframe
Question:
Imagine I have the following dataframe with repetitive people by firstname and lastname:
ID FirstName LastName Country
1 Paulo Cortez Brasil
2 Paulo Cortez Brasil
3 Paulo Cortez Espanha
1 Maria Lurdes Espanha
1 Maria Lurdes Espanha
1 John Page USA
2 Felipe Cardoso Brasil
2 John Page USA
3 Felipe Cardoso Espanha
2 Steve Xis UK
1 Peter Dave UK
np.nan Peter Dave UK
The issue I have is, if the person appears once, the ID should always be 1. If the person appears more than once (looking by only firstname and lastname) the ID should be sequential starting with 1 (in any row) and adding +1 for each other duplicated record.
I need a way to filter this dataframe to find people not following this logic (getting either the unique record or all records of the person if duplicated), this way returning this data:
ID FirstName LastName Country
1 Maria Lurdes Espanha
1 Maria Lurdes Espanha
2 Felipe Cardoso Brasil
3 Felipe Cardoso Espanha
2 Steve Xis UK
1 Peter Dave UK
np.nan Peter Dave UK
What would be the best way to achieve it?
Answers:
Since the ID sequence can be represented as a linear list from 1 to the number of entries + 1, you can use a groupby and filter to find your odd ones
First the code, then an explanation:
>>> df.groupby(["FirstName","LastName"]).filter(lambda x: sorted(x.ID) != list(range(1, x.ID.count()+1)))
ID FirstName LastName Country
3 1.0 Maria Lurdes Espanha
4 1.0 Maria Lurdes Espanha
6 2.0 Felipe Cardoso Brasil
8 3.0 Felipe Cardoso Espanha
9 2.0 Steve Xis UK
10 1.0 Peter Dave UK
11 NaN Peter Dave UK
First, we do a .groupby(["FirstName","LastName"])
to ensure that a group only consists of people with the exact same name.
We then filter the groups with .filter(lambda x: sorted(x.ID) != list(range(1, len(x.ID)+1)))
. This section checks whether the IDs are sorted and equal to the ideal list, going from 1 to n where n is the number of entries.
Imagine I have the following dataframe with repetitive people by firstname and lastname:
ID FirstName LastName Country
1 Paulo Cortez Brasil
2 Paulo Cortez Brasil
3 Paulo Cortez Espanha
1 Maria Lurdes Espanha
1 Maria Lurdes Espanha
1 John Page USA
2 Felipe Cardoso Brasil
2 John Page USA
3 Felipe Cardoso Espanha
2 Steve Xis UK
1 Peter Dave UK
np.nan Peter Dave UK
The issue I have is, if the person appears once, the ID should always be 1. If the person appears more than once (looking by only firstname and lastname) the ID should be sequential starting with 1 (in any row) and adding +1 for each other duplicated record.
I need a way to filter this dataframe to find people not following this logic (getting either the unique record or all records of the person if duplicated), this way returning this data:
ID FirstName LastName Country
1 Maria Lurdes Espanha
1 Maria Lurdes Espanha
2 Felipe Cardoso Brasil
3 Felipe Cardoso Espanha
2 Steve Xis UK
1 Peter Dave UK
np.nan Peter Dave UK
What would be the best way to achieve it?
Since the ID sequence can be represented as a linear list from 1 to the number of entries + 1, you can use a groupby and filter to find your odd ones
First the code, then an explanation:
>>> df.groupby(["FirstName","LastName"]).filter(lambda x: sorted(x.ID) != list(range(1, x.ID.count()+1)))
ID FirstName LastName Country
3 1.0 Maria Lurdes Espanha
4 1.0 Maria Lurdes Espanha
6 2.0 Felipe Cardoso Brasil
8 3.0 Felipe Cardoso Espanha
9 2.0 Steve Xis UK
10 1.0 Peter Dave UK
11 NaN Peter Dave UK
First, we do a .groupby(["FirstName","LastName"])
to ensure that a group only consists of people with the exact same name.
We then filter the groups with .filter(lambda x: sorted(x.ID) != list(range(1, len(x.ID)+1)))
. This section checks whether the IDs are sorted and equal to the ideal list, going from 1 to n where n is the number of entries.