How to use multiple string conditions and numerical calculations on multiple columns to create multiple columns?

Question:

Input:

(Having error in uploading image, otherwise I always do.)

import pandas as pd

df = pd.DataFrame(
    {
     'keyword': ['apple store', 'apple marketing', 'apple store', 'apple marketing'],
     'rank': [10, 12, 10, 12],
     'impression': [100, 200, 100, 200],
     'landing page': ['nol.com/123', 'nol.com/123', 'oats.com/123', 'oats.com/123']
    }
)

df

Output:

import pandas as pd

df = pd.DataFrame(
    {
        'keyword': ['apple', 'store', 'marketing', 'apple', 'store', 'marketing'],
        'mean_rank': [11.0, 10.0, 12.0, 11.0, 10.0, 12.0],
        'impression': [300, 100, 200, 300, 100, 200],
        'landing page': ['nol.com/123', 'nol.com/123', 'nol.com/123', 'oats.com/123', 'oats.com/123', 'oats.com/123'],
        'keyword_length': [5, 5, 9, 5, 5, 9],
        'impression_per_char': [50.0, 16.67, 20.0, 50.0, 16.67, 20.0]
    }
)

df

Maybe this could be used to find words in keyword:

words = 'apple store'
re.findall('w+', words.casefold())

mean_rank = Mean rank of the word in keyword.

keyword_length = length of the word in keyword.

impression_per_char = Impression/(keyword_length + 1)

Actual dataset has 10,000 rows. This one is made by me, please tell if something is wrong with it. I’ll be parallelly working on this for the next few hours.

Also, for ‘mean_rank’ column, you can take weighted mean or some made up equation that (maybe also) uses ‘impression’, ‘keyword_length’ and/or ‘impression_per_char’ to find a sensible rank. If you do so, then I’ll select that as final answer instead.

Asked By: sonar

||

Answers:

Use Series.str.casefold with Series.str.split and DataFrame.explode for separate words, get legths of words by Series.str.len, then aggregate sum and mean and last create impression_per_char column:

df = df.assign(keyword = df['keyword'].str.casefold().str.split()).explode('keyword')
df['keyword_length'] = df['keyword'].str.len()
    
df = (df.groupby(['keyword','landing page', 'keyword_length' ], as_index=False, sort=False)
        .agg(mean_rank=('rank','mean'), impression=('impression', 'sum')))

df['impression_per_char'] = df['impression'].div(df['keyword_length'].add(1))
print (df)
     keyword  landing page  keyword_length  mean_rank  impression  
0      apple   nol.com/123               5         11         300   
1      store   nol.com/123               5         10         100   
2  marketing   nol.com/123               9         12         200   
3      apple  oats.com/123               5         11         300   
4      store  oats.com/123               5         10         100   
5  marketing  oats.com/123               9         12         200   

   impression_per_char  
0            50.000000  
1            16.666667  
2            20.000000  
3            50.000000  
4            16.666667  
5            20.000000  
Answered By: jezrael