Formulating self increasing flag with end string based condition

Question:

I have the following Dataframe

df = pd.DataFrame({'Category': {0: 'onboarding segment-confirmation-unexpected-input origin',
  1: 'onboarding segment-confirmation-unexpected-input view',
  2: 'product-availability cpf-request-unexpected-input origin',
  3: 'product-availability postalcode-validation-true-unexpected-input origin',
  4: 'product-availability postalcode-validation-true-unexpected-input view'},
 'UserId': {0: 9090, 1: 4545, 2: 3266, 3: 2894, 4: 2772}})

What I want to do is to formulate a flag that checks if the string part that is different than the word "view" or "origin". Is equal to the previous value, if so maintain the flag if not increase the flag value.

Wanted result

df = pd.DataFrame({'Category': {0: 'onboarding segment-confirmation-unexpected-input origin',
      1: 'onboarding segment-confirmation-unexpected-input view',
      2: 'product-availability cpf-request-unexpected-input origin',
      3: 'product-availability postalcode-validation-true-unexpected-input origin',
      4: 'product-availability postalcode-validation-true-unexpected-input view'},
     'UserId': {0: 9090, 1: 4545, 2: 3266, 3: 2894, 4: 2772},
'Flag':{0:'Flag_1',1:'Flag_1',2:'Flag_2',3:'Flag_3',4:'Flag_3'}})

What would be the way to do this? I tried to slice it and formulating a groupby but I am having a little difficulty on the increasing part.

Asked By: INGl0R1AM0R1

||

Answers:

Assuming you want to consider the first 2 blocks or string (blocks beinf separated by spaces):

# get substrings, keep first 2 (can be changed)
df2 = df['Category'].str.split(expand=True).iloc[:, :2]

# start new group if any value is different from the previous row
group = df2.ne(df2.shift()).any(axis=1).cumsum()

# add flag
df['Flag'] = 'Flag_'+group.astype(str)

output:

                                            Category  UserId    Flag
0  onboarding segment-confirmation-unexpected-inp...    9090  Flag_1
1  onboarding segment-confirmation-unexpected-inp...    4545  Flag_1
2  product-availability cpf-request-unexpected-in...    3266  Flag_2
3  product-availability postalcode-validation-tru...    2894  Flag_3
4  product-availability postalcode-validation-tru...    2772  Flag_3
Answered By: mozway

This works for me :

df = pd.DataFrame({'Category': {0: 'onboarding segment-confirmation-unexpected-input origin',
  1: 'onboarding segment-confirmation-unexpected-input view',
  2: 'product-availability cpf-request-unexpected-input origin',
  3: 'product-availability postalcode-validation-true-unexpected-input origin',
  4: 'product-availability postalcode-validation-true-unexpected-input view'},
 'UserId': {0: 9090, 1: 4545, 2: 3266, 3: 2894, 4: 2772}})

#I chose 40 but you can change it to fit your needs depending on the data
df['temp']=df['Category'].str[:40]

df['Flag'] = df.groupby(['temp'], sort=False).ngroup() + 1
df['Flag'] ='Flag_' + df['Flag'].astype(str)


Answered By: grymlin
df1=df.Category.str.split(' ',expand=True).iloc[:,:-1]
df.assign(flag=df1.ne(df1.shift()).any(axis=1).cumsum().map('Flag_{}'.format))

out


                                          Category  UserId    flag
0  onboarding segment-confirmation-unexpected-inp...    9090  Flag_1
1  onboarding segment-confirmation-unexpected-inp...    4545  Flag_1
2  product-availability cpf-request-unexpected-in...    3266  Flag_2
3  product-availability postalcode-validation-tru...    2894  Flag_3
4  product-availability postalcode-validation-tru...    2772  Flag_3
Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.