Extract all phrases from a pandas dataframe based on multiple words in list

Question:

I have a list, L:

L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']

I have a pandas DataFrame, DF:

Text
the objects are both before and after the person
the object is behind the person
the object in right is next to top left hand side of person

I would like to extract all words in L from the DF column ‘Text’ in such a manner:

Text Extracted_Value
the objects are both before and after the person before_after
the object is behind the person behind
the object in right is next to top left hand side of person right_top left hand side

For case 1 and 2, my code is working:

L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
pattern = r"(?:^|s+)(" + "|".join(L) + r")(?:s+|$)"
df["Extracted_Value "] = (
    df['Text'].str.findall(pattern).str.join("_").replace({"": None})
)

For CASE 3, I get right_top_hand.

As in the third example, If identified words are contiguous, they are to be picked up as a phrase (one extraction). So in the object in right is next to top left hand side of person, there are two extractions – right and top left hand side. Hence, only these two extractions are separated by an _.

I am not sure how to get it to work!

Asked By: Parsh

||

Answers:

This works for me, it just compares each items in the list with each item in the the phrase in each row.

L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']

df = pd.DataFrame(
['the objects are both before and after the person',
'the object is behind the person',
'the object in right is next to top left hand side of person'], columns=['Text'])

df['Extracted_Value'] = df['Text'].str.split().apply(lambda x: '_'.join([m for m in x if m in L])).replace('',np.nan)

My output is,

    Text    Extracted_Value
0   the objects are both before and after the person    before_after
1   the object is behind the person                     behind
2   the object in right is next to top left hand s...   right_top_left_hand_side
Answered By: anarchy

Try:

df["Extracted_Value"] = (
    df.Text.apply(
        lambda x: "|".join(w if w in L else "" for w in x.split()).strip("|")
    )
    .replace(r"|{2,}", "_", regex=True)
    .str.replace("|", " ", regex=False)
)
print(df)

Prints:

                                                          Text           Extracted_Value
0             the objects are both before and after the person              before_after
1                              the object is behind the person                    behind
2  the object in right is next to top left hand side of person  right_top left hand side

EDIT: Adapting @Wiktor’s answer to pandas:

pattern = fr"b((?:{'|'.join(L)})(?:s+(?:{'|'.join(L)}))*)b"

df["Extracted_Value"] = (
    df["Text"].str.extractall(pattern).groupby(level=0).agg("_".join)
)
print(df)
Answered By: Andrej Kesely

You need to use

pattern = fr"b(?:{'|'.join(L)})(?:s+(?:{'|'.join(L)}))*b"

The regex will look like

b(?:top|left|behind|before|right|after|hand|side)(?:s+(?:top|left|behind|before|right|after|hand|side))*b

See the regex demo.

It will match

  • b – a word boundary
  • (?:{'|'.join(L)}) – one of the words in L
  • (?:s+(?:{'|'.join(L)}))* – zero or more repetitions of one or more whitespaces and then a word from the L list
  • b – a word boundary.

Python demo:

import pandas as pd
L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
df = pd.DataFrame({'Text':["the objects are both before and after the person","the object is behind the person", "the object in right is next to top left hand side of person"]})
pattern = fr"b(?:{'|'.join(L)})(?:s+(?:{'|'.join(L)}))*b"

Output:

>>> df['Text'].str.findall(pattern).str.join("_").replace({"": None})
0                before_after
1                      behind
2    right_top left hand side
Name: Text, dtype: object
Answered By: Wiktor Stribiżew