Extract pattern from a column based on another column's value
Question:
given two columns of a pandas dataframe:
import pandas as pd
df = {'word': ['replay','replayed','playable','thinker','think','thoughtful', 'ex)mple'],
'root': ['play','play','play','think','think','think', 'ex)mple']}
df = pd.DataFrame(df, columns= ['word','root'])
I’d like to extract the substring of column word
that includes everything up to the end of the string in the corresponding column root
or NaN
if the string in root
is not included in word
. That is, the resulting dataframe would look as follows:
word root match
replay play replay
replayed play replay
playable play play
thinker think think
think think think
thoughtful think NaN
ex)mple ex)mple ex)mple
My dataframe has several thousand rows, so I’d like to avoid for-loops if necessary.
Answers:
You can use a regex with str.extract
in a groupby
+apply
:
import re
df['match'] = (df.groupby('root')['word']
.apply(lambda g: g.str.extract(f'^(.*{re.escape(g.name)})'))
)
Or, if you expect few repeated "root" values:
import re
df['match'] = df.apply(lambda r: m.group()
if (m:=re.match(f'.*{re.escape(r["root"])}', r['word']))
else None, axis=1)
output:
word root match
0 replay play replay
1 replayed play replay
2 playable play play
3 thinker think think
4 think think think
5 thoughtful think NaN
Based on the answer by mozway, the regex can also be pieced together, thankfully. A different application, that one might think would be commonly useful.
Here, there are two columns full
and tiny
with a third … context
being created.
tiny
like 30 year old
(although these vary a lot, day, week, month, decade etc) was extracted from long content in the full
string/column (and then operated on to get just the integer in yet another column that doesn’t matter for these purposes).
It was decided that more surrounding context
instead of just the essential tiny
string would be better and this solved that without needing to do intricate surgery on existing code.
df['context'] = df.groupby('tiny', group_keys=False)['full'].apply(
lambda g: g.str.extract(
r'b(.{0,20}' + f'{re.escape(g.name)}' + r'.{0,20}b)'
)
)
To explain that regex:
r'b(.{0,20}' + f'{re.escape(g.name)}' + r'.{0,20}b)'
… it says basically for what’s found in the column titled tiny
on each row, find its match over in the column named full
but add up to 20 characters before it (stop short at a word boundary when necessary to avoid having a word cut off part way) and also add up to 20 characters after it, and likewise regarding the b
.
group_keys=False
is to avoid a ‘FutureWarning’ at Python 3.7
given two columns of a pandas dataframe:
import pandas as pd
df = {'word': ['replay','replayed','playable','thinker','think','thoughtful', 'ex)mple'],
'root': ['play','play','play','think','think','think', 'ex)mple']}
df = pd.DataFrame(df, columns= ['word','root'])
I’d like to extract the substring of column word
that includes everything up to the end of the string in the corresponding column root
or NaN
if the string in root
is not included in word
. That is, the resulting dataframe would look as follows:
word root match
replay play replay
replayed play replay
playable play play
thinker think think
think think think
thoughtful think NaN
ex)mple ex)mple ex)mple
My dataframe has several thousand rows, so I’d like to avoid for-loops if necessary.
You can use a regex with str.extract
in a groupby
+apply
:
import re
df['match'] = (df.groupby('root')['word']
.apply(lambda g: g.str.extract(f'^(.*{re.escape(g.name)})'))
)
Or, if you expect few repeated "root" values:
import re
df['match'] = df.apply(lambda r: m.group()
if (m:=re.match(f'.*{re.escape(r["root"])}', r['word']))
else None, axis=1)
output:
word root match
0 replay play replay
1 replayed play replay
2 playable play play
3 thinker think think
4 think think think
5 thoughtful think NaN
Based on the answer by mozway, the regex can also be pieced together, thankfully. A different application, that one might think would be commonly useful.
Here, there are two columns full
and tiny
with a third … context
being created.
tiny
like 30 year old
(although these vary a lot, day, week, month, decade etc) was extracted from long content in the full
string/column (and then operated on to get just the integer in yet another column that doesn’t matter for these purposes).
It was decided that more surrounding context
instead of just the essential tiny
string would be better and this solved that without needing to do intricate surgery on existing code.
df['context'] = df.groupby('tiny', group_keys=False)['full'].apply(
lambda g: g.str.extract(
r'b(.{0,20}' + f'{re.escape(g.name)}' + r'.{0,20}b)'
)
)
To explain that regex:
r'b(.{0,20}' + f'{re.escape(g.name)}' + r'.{0,20}b)'
… it says basically for what’s found in the column titled tiny
on each row, find its match over in the column named full
but add up to 20 characters before it (stop short at a word boundary when necessary to avoid having a word cut off part way) and also add up to 20 characters after it, and likewise regarding the b
.
group_keys=False
is to avoid a ‘FutureWarning’ at Python 3.7