Extract pattern from a column based on another column's value

Question:

given two columns of a pandas dataframe:

import pandas as pd
df = {'word': ['replay','replayed','playable','thinker','think','thoughtful', 'ex)mple'],
      'root': ['play','play','play','think','think','think', 'ex)mple']}
df = pd.DataFrame(df, columns= ['word','root'])

I’d like to extract the substring of column word that includes everything up to the end of the string in the corresponding column root or NaN if the string in root is not included in word. That is, the resulting dataframe would look as follows:

word       root    match
replay     play    replay
replayed   play    replay
playable   play    play
thinker    think   think
think      think   think
thoughtful think   NaN
ex)mple    ex)mple ex)mple

My dataframe has several thousand rows, so I’d like to avoid for-loops if necessary.

Asked By: hyhno01

||

Answers:

You can use a regex with str.extract in a groupby+apply:

import re
df['match'] = (df.groupby('root')['word']
                 .apply(lambda g: g.str.extract(f'^(.*{re.escape(g.name)})'))
               )

Or, if you expect few repeated "root" values:

import re
df['match'] = df.apply(lambda r: m.group()
                       if (m:=re.match(f'.*{re.escape(r["root"])}', r['word']))
                       else None, axis=1)

output:

         word   root   match
0      replay   play  replay
1    replayed   play  replay
2    playable   play    play
3     thinker  think   think
4       think  think   think
5  thoughtful  think     NaN
Answered By: mozway

Based on the answer by mozway, the regex can also be pieced together, thankfully. A different application, that one might think would be commonly useful.

Here, there are two columns full and tiny with a third … context being created.

tiny like 30 year old (although these vary a lot, day, week, month, decade etc) was extracted from long content in the full string/column (and then operated on to get just the integer in yet another column that doesn’t matter for these purposes).

It was decided that more surrounding context instead of just the essential tiny string would be better and this solved that without needing to do intricate surgery on existing code.

df['context'] = df.groupby('tiny', group_keys=False)['full'].apply(
   lambda g: g.str.extract(
      r'b(.{0,20}' + f'{re.escape(g.name)}' + r'.{0,20}b)'
   )
)        

To explain that regex:

r'b(.{0,20}' + f'{re.escape(g.name)}' + r'.{0,20}b)'

… it says basically for what’s found in the column titled tiny on each row, find its match over in the column named full but add up to 20 characters before it (stop short at a word boundary when necessary to avoid having a word cut off part way) and also add up to 20 characters after it, and likewise regarding the b.

group_keys=False is to avoid a ‘FutureWarning’ at Python 3.7

Answered By: gseattle
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.