Removing duplicate rows based on values in row somewhere above

Question

Here is an example of my dataframe:

df = pd.DataFrame([['In', 'Age', 'Nat.'],
                   ['Jakub Kiwior', 22, 'Poland'],
                   ['Leandro Trossard', 28, 'Belgium'],
                   ['Jorginho', 31, 'Italy'],
                   ['Out', 'Age', 'Nat.'],
                   ['Jhon Durán', 19, 'Colombia'],
                   ['In', 'Age', 'Nat.'],
                   ['Jhon Durán', 19, 'Colombia'],
                   ['Álex Moreno', 29, 'Spain'],
                   ['Out', 'Age', 'Nat.'],
                   ['Leandro Trossard', 28, 'Belgium'],
                   ['Jorginho', 31, 'Italy'],
                   ['In', 'Age', 'Nat.'],
                   ['Out', 'Age', 'Nat.'],
                   ['In', 'Age', 'Nat.'],
                  ], columns=['Player', 'Age', 'Nat.'])

My desired output is a dataframe that removes duplicate rows if the row above (not necessarily directly above) has the value ‘Out’ in the ‘Player’ column.

For example, the desired output would remove the first "Jhon Durán" row, and the second "Leandro Trossard" and "Jorginho" rows, since these are the rows with "Out" above them and not "In".

Is this possible to achieve with pandas?

Asked By: code_machine

||

Source

Answer 1

You could use Pandas shift method to help achieve this.

df['previousPlayer'] = df['Player'].shift(1)
df

                 Player  Age      Nat.    previousPlayer
0                 In  Age      Nat.               NaN
1       Jakub Kiwior   22    Poland                In
2   Leandro Trossard   28   Belgium      Jakub Kiwior
3           Jorginho   31     Italy  Leandro Trossard
4                Out  Age      Nat.          Jorginho
5         Jhon Durán   19  Colombia               Out
6                 In  Age      Nat.        Jhon Durán
7         Jhon Durán   19  Colombia                In
8        Álex Moreno   29     Spain        Jhon Durán
9                Out  Age      Nat.       Álex Moreno
10  Leandro Trossard   28   Belgium               Out
11          Jorginho   31     Italy  Leandro Trossard
12                In  Age      Nat.          Jorginho
13               Out  Age      Nat.                In
14                In  Age      Nat.               Out

Then simply filter out any values in the new column with the word of your choice:

df = df[df.previousPlayer != 'Out'].drop('previousPlayer', axis=1)
print(df)

              Player  Age      Nat.
0                 In  Age      Nat.
1       Jakub Kiwior   22    Poland
2   Leandro Trossard   28   Belgium
3           Jorginho   31     Italy
4                Out  Age      Nat.
6                 In  Age      Nat.
7         Jhon Durán   19  Colombia
8        Álex Moreno   29     Spain
9                Out  Age      Nat.
11          Jorginho   31     Italy
12                In  Age      Nat.
13               Out  Age      Nat.

Answered By: straka86

Answer 2

It would be more beneficial to add an additional column "out/in" or another way to store your data. If we look at the header "player" it should only have players in it; "out" or "in" is not a relevant data entry for that column. It’s really important to make sure we are creating clean and easily manipulated data.

Once we add a new column "out/in" we can easily change the player’s status and drop players that are out.

Here is the above applied. Notice the addition of the status column:

df = pd.DataFrame([['Jakub Kiwior', 22, 'Poland','in'],
                   ['Leandro Trossard', 28, 'Belgium','in'],
                   ['Jorginho', 31, 'Italy','in'],
                   ['Jhon Durán', 19, 'Colombia', 'out'],
                   ['Álex Moreno', 29, 'Spain','in'],
                   ['Leandro Trossard', 28, 'Belgium', 'out'],
                   ['Jorginho', 31, 'Italy','in'],
                  ], columns=['Player', 'Age', 'Nat.', 'status'])

Here is the original data frame:

if we want to change a player from in to out:

player_name = 'Leandro Trossard'
df.loc[df['Player'] == player_name, 'status'] = 'out'

Which gives us:

if we want to drop all players that are out from the dataframe:

in_players_df = df[df['status']!='out']

Which gives us:

This creates a new df with only ‘in’ players. This allows us to have access to a filtered view (in_players_df), and a view of all players regardless of in/out (df). If we used df=... it would change the original dataframe. Just depends what you’re after 🙂

Answered By: ayliah

Answer 3

I believe this is what youre looking for:

df.loc[df['Player'].isin(['In','Out']).cumsum().mod(2).eq(1) | df['Player'].eq('Out')]

Output:

              Player  Age      Nat.
0                 In  Age      Nat.
1       Jakub Kiwior   22    Poland
2   Leandro Trossard   28   Belgium
3           Jorginho   31     Italy
4                Out  Age      Nat.
6                 In  Age      Nat.
7         Jhon Durán   19  Colombia
8        Álex Moreno   29     Spain
9                Out  Age      Nat.
12                In  Age      Nat.
13               Out  Age      Nat.
14                In  Age      Nat.

Answered By: rhug123

Answer 4

Use boolean indexing:

# keep rows that are after In and drop those after Out
m1 = df['Player'].map({'In': True, 'Out': False}).ffill()
# keep rows with In/Out
m2 = df['Player'].isin(['In', 'Out'])

out = df.loc[m1|m2]

Output:

              Player  Age      Nat.
0                 In  Age      Nat.
1       Jakub Kiwior   22    Poland
2   Leandro Trossard   28   Belgium
3           Jorginho   31     Italy
4                Out  Age      Nat.
6                 In  Age      Nat.
7         Jhon Durán   19  Colombia
8        Álex Moreno   29     Spain
9                Out  Age      Nat.
12                In  Age      Nat.
13               Out  Age      Nat.
14                In  Age      Nat.

Answered By: mozway

Removing duplicate rows based on values in row somewhere above

Question:

Answers: