# select rows based on multi conditions with keep or change group number- Dataframe

## Question:

I have a pandas DataFrame (df)

```
import pandas as pd
import numpy as np
df=pd.DataFrame({'user': ['user 1', 'user 2', 'user 3', 'user 4', 'user 1', 'user 2', 'user 3', 'user 4'],
'group': [0, 0, 0, 0, 1, 1, 1, 1],
'p1': [0.759, 1.106, 1.619, 1.260, 0.540, 1.437, 1.440, 1.332],
'p2': [0.9, 0.9, 0.9, 0.9, 0.7, 0.7, 0.7, 0.7],})
df
```

output:

```
user group p1 p2
0 user 1 0 0.759 0.9
1 user 2 0 1.106 0.9
2 user 3 0 1.619 0.9
3 user 4 0 1.260 0.9
4 user 1 1 0.540 0.7
5 user 2 1 1.437 0.7
6 user 3 1 1.440 0.7
7 user 4 1 1.332 0.7
```

I want to return each user with a condition if p1 is below p2 then return this row and if there is no row that meets this condition when p1 is below p2 then return this user with a change group number to a new group number (a random number which not in group list).

For example: for the user1, row number 4 should be selected since it returns a min value of p1 below p2 with group number 1, and even row 0 meet this condition but still, row 4 has a min value of p1.

For users 2, 3, and 4, all p1 is higher than p2 for all rows, so we should change the group number to a new value.

I used the following code but it change the group number to the max number of the group numbers (here 2).

```
mylist=df['group'].values.tolist()
lst = list(set(mylist))
df2 = (df[df['p1'].lt(df['p2'])]
.set_index('group')
.groupby('user')['p1']
.idxmin()
.reindex(df['user'].unique(), fill_value=max(set(lst))+1)
.reset_index(name='group'))
df2
```

output:

```
user group
0 user 1 1
1 user 2 2
2 user 3 2
3 user 4 2
```

The expected output: when the condition is not met (p1 is higher than p2) replace the group number with a random number that is not in the group number list (her group list=[0,1])

## Answers:

You can use:

```
# Create a replacement group for each user, start for group max
new_group = np.arange(df['user'].nunique()) + df['group'].max() + 1
# Keep one instance of user (one have the most probability to satisfy conditions)
out = (df.assign(dp=lambda x: x['p1'] - x['p2'])
.sort_values(['dp', 'p1'])
.drop_duplicates('user'))
# Set new group if needed else keep original group
out['group'] = np.where(out['dp'] < 0, out['group'], new_group)
```

Output:

```
>>> out
user group p1 p2 dp
4 user 1 1 0.540 0.7 -0.160
1 user 2 3 1.106 0.9 0.206
3 user 4 4 1.260 0.9 0.360
2 user 3 5 1.619 0.9 0.719
```

You can manipulate the group numbers before `groupby("user")`

:

- If a row has
`p1 < p2`

, keep the same`group`

- Otherwise, change it to an arbitrary number that does not appear in the original data, and also unique within the output.

For the second part, since we don’t care about the new group number, we can simply take `df["group"] + df["group"].max() + [1, 2, 3, ...]`

:

```
s = np.where(df["p1"] < df["p2"], 0, df["group"].max() + np.arange(len(df)) + 1)
result = (
df.assign(new_group=df["group"] + s)
.sort_values(["user", "p1"])
.groupby("user")
.head(1)
)
```

Result:

```
user group p1 p2 new_group
4 user 1 1 0.540 0.7 1
1 user 2 0 1.106 0.9 3
6 user 3 1 1.440 0.7 9
3 user 4 0 1.260 0.9 5
```

Trim the columns as needed.