group by with conditions python keeping all lines

Question:

I have the following pandas dataframe:

import pandas as pd
df = pd.DataFrame({
    "review_num": [2,2,2,1,1,1,1,1,3],
    "review": ["The second review","The second review","The second review",
               "This is the first review","This is the first review",
               "This is the first review","This is the first review",
               "This is the first review",'No'],
    "token_num":[1,2,3,1,2,3,4,5,1],
    "token":["The","second","review","This","is","the","first","review","No"],
    "score":[0.3,-0.6,0.4,0.5,0.8,-0.7,0.6,0.4,0.3]
})

   review_num                    review  token_num   token  score
0           2         The second review          1     The    0.3
1           2         The second review          2  second   -0.6
2           2         The second review          3  review    0.4
3           1  This is the first review          1    This    0.5
4           1  This is the first review          2      is    0.8
5           1  This is the first review          3     the   -0.7
6           1  This is the first review          4   first    0.6
7           1  This is the first review          5  review    0.4
8           3                        No          1      No    0.3

I need to get the lines as below:

  1. If the review contains "t" or "r": get the review line with the max score (just for lines with token containing t or r)
  2. If the review doesn’t contain "t" or "r": get just one line of the review
  3. Keep the order of reviews as the order in the original table

With this code, I respect 1 and 3 but I don’t see how to respect the second rule without perturbing the third rule.

prelist=df['token'].str.contains('|'.join(['t','r']))

token_max_score = df[prelist].groupby('review_num', sort=False)['score'].idxmax()

Current result:

review_num
2    2
1    6

Expected result :

review_num
2    2
1    6
3    8
Asked By: SLA

||

Answers:

Use:

# rows with t/r in token
m = df['token'].str.contains('r|t')
# identify reviews with no match
m2 = (~m).groupby(df['review_num']).transform('all')

# for each group get idxmax
df[m|m2].groupby('review_num', sort=False)['score'].idxmax()

Output:

review_num
2    2
1    6
3    8
Name: score, dtype: int64

previous answer

You can use a custom groupby.apply:

(df.groupby('review_num', sort=False)
   .apply(lambda g: g['score'].idxmax()
          if set(g['review'].iloc[0]).intersection(['t', 'r'])
          else g.sample(n=1).index[0])
)

Example output:

review_num
2    2
1    3
3    8
dtype: int64

Logic:

  • we group by "review_num", keeping the original order of the groups
  • for each group we convert the "review" to set and compare it to t/r, if the intersection is not empty pick the idxmax
  • else pick a random row
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.